Sorting variables alphabetically in a sed - linux

In short i have a bash script that finds all 5 letter words in the dictionary with a single duplicated letter. Im using sed to print out the letter that repeats and the letters that dont. I have to sort the letters that don't repeat in alphabetical order, and im not quite sure how
Here's my sed;
sed 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\2\3\2\4 \2 \1\3\4 /'
so i need to sort \1\3\4 by piping them into a read loop
UPDATE
grep '^[a-z][a-z][a-z][a-z][a-z]$' /usr/share/dict/words |
grep '.*\(.\).*\1.*' |
grep -v '.*\(.\).*\1.*\1.*' |
grep -v '.*\(.\).*\(.\).*\1.*\2.*\1.*\2.*' |
grep -v '.*\(.\).*\(.\).*\1.*\2.*\2.*\1.*' |
grep -v '.*\(.\).*\(.\).*\1.*\1.*\2.*\2.*' |
sed 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\2\3\2\4 \2 \1\3\4/' |
while read word dup nondup
do sort -$nondup
front=$[nondup:1]
middle=$[nondup:2]
back=$[nondup:3]
echo $word $dup $front$middle$back
done

For a sample dictionary of 5-letter words:
$ cat file
timey
terra
debby
ovolt
spell
Now, using your sed command, let's sort the output by the non-repeating letters:
$ sed 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\2\3\2\4 \2 \1\3\4 /' file | sort -k3
timey
debby b dey
spell l spe
terra r tea
ovolt o vlt
sort -k3 sorts over the third column.
Above but also sort the non-recurring letters alphabetically
This solution adds a shell while loop in order to sort the non-recurring letters:
sed 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\2\3\2\4 \2 \1\3\4 /' file | while read word rep non
do
non=$(echo "$non" | grep -o . | sort |tr -d "\n")
echo "$word $rep $non"
done | sort -k3
On the same input, this produces the output:
timey
terra r aet
debby b dey
spell l eps
ovolt o ltv
Primitive Method Suitable Only For Instructors
If I understand correctly, your instructor wants something like this:
sed 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\2\3\2\4 \2 \1\3\4 /' file |
while read word rep non
do
[ "$non" ] || continue # skip any word that lacks a repeating letter
front=${non:0:1}
middle=${non:1:1}
back=${non:2:1}
if [[ "$front" < "$middle" ]] && [[ "$front" < "$back" ]]
then
[[ "$middle" < "$back" ]] && non=$front$middle$back || non=$front$back$middle
elif [[ "$middle" < "$front" ]] && [[ "$middle" < "$back" ]]
then
[[ "$front" < "$back" ]] && non=$middle$front$back || non=$middle$back$front
elif [[ "$back" < "$front" ]] && [[ "$back" < "$middle" ]]
then
[[ "$front" < "$middle" ]] && non=$back$front$middle || non=$back$middle$front
else
echo "ERROR"
fi
echo "$word $rep $non"
done | sort -k3
This method requires bash.

You can simply modify your sed command and pipe to sort to sort the 3 characters most efficiently. In addition to John's answer, if your question wants only the remnants sorted:
sed -e 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\3\4/' stack/dat/dicta.dat | sort
input:
$ cat stack/dat/dicta.dat
aback
abaft
abase
abash
abask
abate
output:
$ sed -e 's/\(.*\)\(.\)\(.*\)\2\(.*\)/\1\3\4/' stack/dat/dicta.dat | sort
bck
bft
bse
bsh
bsk
bte
If you want the full output sorted, then calling sort following your original sed with option -k3 is the correct way.

Related

How to count the number of numbers/letters in file?

I try to count the number of numbers and letters in my file in Bash.
I know that I can use wc -c file to count the number of characters but how can I fix it to only letters and secondly numbers?
Here's a way completely avoiding pipes, just using tr and the shell's way to give the length of a variable with ${#variable}:
$ cat file
123 sdf
231 (3)
huh? 564
242 wr =!
$ NUMBERS=$(tr -dc '[:digit:]' < file)
$ LETTERS=$(tr -dc '[:alpha:]' < file)
$ ALNUM=$(tr -dc '[:alnum:]' < file)
$ echo ${#NUMBERS} ${#LETTERS} ${#ALNUM}
13 8 21
To count the number of letters and numbers you can combine grep with wc:
grep -o [a-z] myfile | wc -c
grep -o [0-9] myfile | wc -c
With little bit of tweaking you can modify it to count numbers or alphabetic words or alphanumeric words like this,
grep -o [a-z]+ myfile | wc -c
grep -o [0-9]+ myfile | wc -c
grep -o [[:alnum:]]+ myfile | wc -c
You can use sed to replace all characters that are not of the kind that you are looking for and then word count the characters of the result.
# 1h;1!H will place all lines into the buffer that way you can replace
# newline characters
sed -n '1h;1!H;${;g;s/[^a-zA-Z]//g;p;}' myfile | wc -c
It's easy enough to just do numbers as well.
sed -n '1h;1!H;${;g;s/[^0-9]//g;p;}' myfile | wc -c
Or why not both.
sed -n '1h;1!H;${;g;s/[^0-9a-zA-Z]//g;p;}' myfile | wc -c
There are a number of ways to approach analyzing the line, word, and character frequency of a text file in bash. Utilizing the bash builtin character case filters (e.g. [:upper:], and so on), you can drill down to the frequency of each occurrence of each character type in a text file. Below is a simple script that reads from stdin and provides the normal wc output as it first line of output, and then outputs the number of upper, lower, digits, punct and whitespace.
#!/bin/bash
declare -i lines=0
declare -i words=0
declare -i chars=0
declare -i upper=0
declare -i lower=0
declare -i digit=0
declare -i punct=0
oifs="$IFS"
# Read line with new IFS, preserve whitespace
while IFS=$'\n' read -r line; do
# parse line into words with original IFS
IFS=$oifs
set -- $line
IFS=$'\n'
# Add up lines, words, chars, upper, lower, digit
lines=$((lines + 1))
words=$((words + $#))
chars=$((chars + ${#line} + 1))
for ((i = 0; i < ${#line}; i++)); do
[[ ${line:$((i)):1} =~ [[:upper:]] ]] && ((upper++))
[[ ${line:$((i)):1} =~ [[:lower:]] ]] && ((lower++))
[[ ${line:$((i)):1} =~ [[:digit:]] ]] && ((digit++))
[[ ${line:$((i)):1} =~ [[:punct:]] ]] && ((punct++))
done
done
echo " $lines $words $chars $file"
echo " upper: $upper, lower: $lower, digit: $digit, punct: $punct, \
whitespace: $((chars-upper-lower-digit-punct))"
Test Input
$ cat dat/captnjackn.txt
This is a tale
Of Captain Jack Sparrow
A Pirate So Brave
On the Seven Seas.
(along with 2357 other pirates)
Example Use/Output
$ bash wcount3.sh <dat/captnjackn.txt
5 21 108
upper: 12, lower: 68, digit: 4, punct: 3, whitespace: 21
You can customize the script to give you as little or as much detail as you like. Let me know if you have any questions.
You can use tr to preserve only alphanumeric characters by combining the the -c (complement) and -d (delete) flags. From there on, it's just a question of some piping:
$ cat myfile.txr | tr -cd [:alnum:] | wc -c

Show uncommon part of the line

Hi I have two files which contain paths. I want to compare the two files and show only uncommon part of the line.
1.txt:
/home/folder_name/abc
2.txt:
/home/folder_name/abc/pqr/xyz/mnp
Output I want:
/pqr/xyz/mnp
How can I do this?
This bit of awk does the job:
$ awk 'NR==FNR {a[++i]=$0; next}
{
b[++j]=$0;
if(length(a[j])>length(b[j])) {t=a[j]; a[j]=b[j]; b[j]=t}
sub(a[j],"",b[j]);
print b[j]
}' 2.txt 1.txt # or 2.txt 1.txt, it doesn't matter
Write the line from the first file to the array a.
Write the line from the second to b.
Swap a[j] and b[j] if a[j] is longer than b[j] (this might not be necessary if the longer text is always in b).
Remove the part found in a[j] from b[j] and print b[j].
This is a general solution; it makes no assumption that the match is at the start of the line, or that the contents of one file's line should be removed from the other. If you can afford to make those assumptions, the script can be simplified.
If the match may occur more than once on the line, you can use gsub rather than sub to perform a global substitution.
Considering you have strings in 1.txt and in 2.txt following code will do.
paste 1.txt 2.txt |
while read a b;
do
if [[ ${#a} -gt ${#b} ]];
then
echo ${a/$b};
else
echo ${b/$a};
fi;
done;
This is how it works on my system,
shiplu#:~/test/bash$ cat 1.txt
/home/shiplu/test/bash
/home/shiplu/test/bash/hello/world
shiplu#:~/test/bash$ cat 2.txt
/home/shiplu/test/bash/good/world
/home/shiplu/test/bash
shiplu#:~/test/bash$ paste 1.txt 2.txt |
> while read a b;
> do
> if [[ ${#a} -gt ${#b} ]];
> then
> echo ${a/$b};
> else
> echo ${b/$a};
> fi;
> done;
/good/world
/hello/world
This script will compare all lines in the file and only output the change in the line.
First it counts the number of lines in the first file.
Then i start a loop that will iterate for the number of lines.
Declare two variable that are the same line from both files.
Compare the lines and if they are the same output that they are.
If they are not then replace duplicate parts of the string with nothing(effectively removing them)
I used : as the seperator in sed as your variables contain /. So if they contain : then you may want to consider changing them.
Probably not the most efficient solution but it works.
#!/bin/bash
NUMOFLINES=$(wc -l < "1.txt")
echo $NUMOFLINES
for ((i = 1 ; i <= $NUMOFLINES ; i++)); do
f1=$(sed -n $i'p' 1.txt)
f2=$(sed -n $i'p' 2.txt)
if [[ $f1 < $f2 ]]; then
echo -n "Line $i:"
sed 's:'"$f1"'::' <<< "$f2"
elif [[ $f1 > $f2 ]]; then
echo -n "Line $i:"
sed 's:'"$f2"'::' <<< "$f1"
else
echo "Line $i: Both lines are the same"
fi
echo ""
done
If you happen to use bash, you could try this one:
echo $(diff <(grep -o . 1.txt) <(grep -o . 2.txt) \
| sed -n '/^[<>]/ {s/^..//;p}' | tr -d '\n')
It does a character-by-character comparison using diff (where grep -o . gives an intermediate line for each character to be fed to line-wise diff), and just prints the differences (intermediate diff output lines starting with markers < or > omitted, then joining lines with tr).
If you have multiple lines in your input (which you did not mention in your question) then try something like this (where % is a character not contained in your input):
diff <(cat 1.txt | tr '\n' '%' | grep -o .) \
<(cat 2.txt | tr '\n' '%' | sed -e 's/%/%%/g' | grep -o .) \
| sed -n '/^[<>]/ {s/^..//;p}' | tr -d '\n' | tr '%' '\n'
This extends the single-line solution by adding line end markers (e.g. %) which diff is forced to include in its output by adding % on the left and %% on the right.
If both the files have always a single line in each, then below works:
perl -lne '$a=$_ if($.==1);print $1 if(/$a(.*)/ && $.==2)' 1.txt 2.txt
Tested Below:
> cat 1.txt
/home/folder_name/abc
> cat 2.txt
/home/folder_name/abc/pqr/xyz/mnp
> perl -lne '$a=$_ if($.==1);print $1 if(/$a(.*)/ && $.==2)' 1.txt 2.txt
/pqr/xyz/mnp
>

Calculate Word occurrences from file in bash

I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file
I know I can do this:
sort | uniq -c | sort
the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on.
What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow.
I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out.
Could you please lead me to the right way?
I'm also pasting the code I wrote:
#!/bin/bash
# count the number of word occurrences from a file and writes to another file #
# the words are listed from the most frequent to the less one #
touch .check # used to check the occurrances. Temporary file
touch distribution.txt # final file with all the occurrences calculated
page=$1 # contains the file I'm calculating
occurrences=$2 # temporary file for the occurrences
# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check
# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
word=${words} # word I'm calculating
strlen=${#word} # word's length
# I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word was never found before it writes it with 1 occurrence
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
rm .check
# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt
Well, I'm not sure that I've got the point of the thing you are trying to do,
but I would do it this way:
while read file
do
cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list
Now you have statistics for all your file, and now you simple aggregate it:
while read file
do
cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'
Example of usage:
$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF
$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list
$ while read file
> do
> cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head
3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

sorting a "key/value pair" array in bash

How do I sort a "python dictionary-style" array e.g. ( "A: 2" "B: 3" "C: 1" ) in bash by the value? I think, this code snippet will make it bit more clear about my question.
State="Total 4 0 1 1 2 0 0"
W=$(echo $State | awk '{print $3}')
C=$(echo $State | awk '{print $4}')
U=$(echo $State | awk '{print $5}')
M=$(echo $State | awk '{print $6}')
WCUM=( "Owner: $W;" "Claimed: $C;" "Unclaimed: $U;" "Matched: $M" )
echo ${WCUM[#]}
This will simply print the array: Owner: 0; Claimed: 1; Unclaimed: 1; Matched: 2
How do I sort the array (or the output), eliminating any pair with "0" value, so that the result like this:
Matched: 2; Claimed: 1; Unclaimed: 1
Thanks in advance for any help or suggestions. Cheers!!
Quick and dirty idea would be (this just sorts the output, not the array):
echo ${WCUM[#]} | sed -e 's/; /;\n/g' | awk -F: '!/ 0;?/ {print $0}' | sort -t: -k 2 -r | xargs
echo -e ${WCUM[#]} | tr ';' '\n' | sort -r -k2 | egrep -v ": 0$"
Sorting and filtering are independent steps, so if you only like to filter 0 values, it would be much more easy.
Append an
| tr '\n' ';'
to get it to a single line again in the end.
nonull=$(for n in ${!WCUM[#]}; do echo ${WCUM[n]} | egrep -v ": 0;"; done | tr -d "\n")
I don't see a good reason to end $W $C $U with a semicolon, but $M not, so instead of adapting my code to this distinction I would eliminate this special case. If not possible, I would append a semicolon temporary to $M and remove it in the end.
Another attempt, using some of the bash features, but still needs sort, that is crucial:
#! /bin/bash
State="Total 4 1 0 4 2 0 0"
string=$State
for i in 1 2 ; do # remove unnecessary fields
string=${string#* }
string=${string% *}
done
# Insert labels
string=Owner:${string/ /;Claimed:}
string=${string/ /;Unclaimed:}
string=${string/ /;Matched:}
# Remove zeros
string=(${string[#]//;/; })
string=(${string[#]/*:0;/})
string=${string[#]}
# Format
string=${string//;/$'\n'}
string=${string//:/: }
# Sort
string=$(sort -t: -nk2 <<< "$string")
string=${string//$'\n'/;}
echo "$string"

How do I find common characters between two strings in bash?

For example:
s1="my_foo"
s2="not_my_bar"
the desired result would be my_o. How do I do this in bash?
My solution below uses fold to break the string into one character per line, sort to sort the lists, comm to compare the two strings and finally tr to delete the new line characters
comm -12 <(fold -w1 <<< $s1 | sort -u) <(fold -w1 <<< $s2 | sort -u) | tr -d '\n'
Alternatively, here is a pure Bash solution (which also maintains the order of the characters). It iterates over the first string and checks if each character is present in the second string.
s="temp_foo_bar"
t="temp_bar"
i=0
while [ $i -ne ${#s} ]
do
c=${s:$i:1}
if [[ $result != *$c* && $t == *$c* ]]
then
result=$result$c
fi
((i++))
done
echo $result
prints: temp_bar
Assuming the strings do not contain embedded newlines:
s1='my_foo' s2='my_bar'
intersect=$(
comm -12 <(
fold -w1 <<< "$s1" |
sort -u
) <(
fold -w1 <<< "$s2" |
sort -u
) |
tr -d \\n
)
printf '%s\n' "$intersect"
And another one:
tr -dc "$s2" <<< "$s1"
a late entry, I've just found this page:
echo "$str2" |
awk 'BEGIN{FS=""}
{ n=0; while(n<=NF) {
if ($n == substr(test,n,1)) { if(!found[$n]) printf("%c",$n); found[$n]=1;} n++;
} print ""}' test="$str1"
and another one, this one builds a regexp for matching (note: doesn't work with special characters, but that's not that hard to fix with anonther sed)
echo "$str1" |
grep -E -o ^`echo -n "$str2" | sed 's/\(.\)/(|\1/g'; echo "$str2" | sed 's/./)/g'`
Should be a portable solution:
s1="my_foo"
s2="my_bar"
while [ -n "$s1" -a -n "$s2" ]
do
if [ "${s1:0:1}" = "${s2:0:1}" ]
then
printf %s "${s1:0:1}"
else
break
fi
s1="${s1:1:${#s1}}"
s2="${s2:1:${#s2}}"
done
A solution using a single sed execution:
echo -e "$s1\n$s2" | sed -e 'N;s/^/\n/;:begin;s/\n\(.\)\(.*\)\n\(.*\)\1\(.*\)/\1\n\2\n\3\4/;t begin;s/\n.\(.*\)\n\(.*\)/\n\1\n\2/;t begin;s/\n\n.*//'
As all cryptic sed script, it needs explanation in the form of a sed script file that can be run by echo -e "$s1\n$s2" | sed -f script:
# Read the next line so s1 and s2 are in the pattern space only separated by a \n.
N
# Put a \n at the beginning of the pattern space.
s/^/\n/
# During the script execution, the pattern space will contain <result so far>\n<what left of s1>\n<what left of s2>.
:begin
# If the 1st char of s1 is found in s2, remove it from s1 and s2, append it to the result and do this again until it fails.
s/\n\(.\)\(.*\)\n\(.*\)\1\(.*\)/\1\n\2\n\3\4/
t begin
# When previous substitution fails, remove 1st char of s1 and try again to find 1st char of S1 in s2.
s/\n.\(.*\)\n\(.*\)/\n\1\n\2/
t begin
# When previous substitution fails, s1 is empty so remove the \n and what is left of s2.
s/\n\n.*//
If you want to remove duplicate, add the following at the end of the script:
:end;s/\(.\)\(.*\)\1/\1\2/;t end
Edit: I realize that dogbane's pure shell solution has the same algorithm, and is probably more efficient.
comm=""
for ((i=0;i<${#s1};i++))
do
if test ${s1:$i:1} = ${s2:$i:1}
then
comm=${comm}${s1:$i:1}
fi
done
Since everyone loves perl one-liners full of punctuation:
perl -e '$a{$_}++ for split "",shift; $b{$_}++ for split "",shift; for (sort keys %a){print if defined $b{$_}}' my_foo not_my_bar
Creates hashes %a and %b from the input strings.
Prints any characters common to both strings.
outputs:
_moy
"flower","flow","flight" --> output fl
s="flower"
t="flow"
i=0
while [ $i -ne ${#s} ]
do
c=${s:$i:1}
if [[ $result != *$c* && $t == *$c* ]]
then
result=$result$c
fi
((i++))
done
echo $result
p=$result
q="flight"
j=0
while [ $j -ne ${#p} ]
do
c1=${p:$j:1}
if [[ $result1 != *$c1* && $q == *$c1* ]]
then
result1=$result1$c1
fi
((j++))
done
echo $result1

Resources