Extracting a string from a substring in bash (yes, that way around)

Extracting a string from a substring in bash (yes, that way around) - string

I have a string of several words in bash called comp_line, which can have any number of spaces inside. For example:
"foo bar apple banana q xy"
And I have a zero-based index comp_point pointing to one character in that string, e.g. if comp_point is 4, it points to the first 'b' in 'bar'.
Based on the comp_point and comp_line alone, I want to extract the word being pointed to by the index, where the "word" is a sequence of letters, numbers, punctuation or any other non-whitespace character, surrounded by whitespace on either side (if the word is at the start or end of the string, or is the only word in the string, it should work the same way.)
The word I'm trying to extract will become cur (the current word)
Based on this, I've come up with a set of rules:
Read the current character curchar, the previous character prevchar, and the next character nextchar. Then:
If curchar is a graph character (non-whitespace), set cur to the letters before and after curchar (stopping until you reach a whitespace or string start/end on either side.)
Else, if prevchar is a graph character, set cur to the letters from the previous letter, backwards until the previous whitespace character/string start.
Else, if nextchar is a graph character, set cur to the letters from the next letter, forwards until the next whitespace character/string end.
If none of the above conditions are hit (meaning curchar, nextchar and prevchar are all whitespace characters,) set cur to "" (empty string)
I've written some code which I think achieves this. Rules 2, 3 and 4 are relatively straightforward, but rule 1 is the most difficult to implement - I've had to do some complicated string slicing. I'm not convinced that my solution is in any way ideal, and want to know if anyone knows of a better way to do this within bash only (not outsourcing to Python or another easier language.)
Tested on https://rextester.com/l/bash_online_compiler
#!/bin/bash
# GNU bash, version 4.4.20
comp_line="foo bar apple banana q xy"
comp_point=19
cur=""
curchar=${comp_line:$comp_point:1}
prevchar=${comp_line:$((comp_point - 1)):1}
nextchar=${comp_line:$((comp_point + 1)):1}
echo "<$prevchar> <$curchar> <$nextchar>"
if [[ $curchar =~ [[:graph:]] ]]; then
# Rule 1 - Extract current word
slice="${comp_line:$comp_point}"
endslice="${slice%% *}"
slice="${slice#"$endslice"}"
slice="${comp_line%"$slice"}"
cur="${slice##* }"
else
if [[ $prevchar =~ [[:graph:]] ]]; then
# Rule 2 - Extract previous word
slice="${comp_line::$comp_point}"
cur="${slice##* }"
else
if [[ $nextchar =~ [[:graph:]] ]]; then
# Rule 3 - Extract next word
slice="${comp_line:$comp_point+1}"
cur="${slice%% *}"
else
# Rule 4 - Set cur to empty string ""
cur=""
fi
fi
fi
echo "Cur: <$cur>"
The current example will return 'banana' as comp_point is set to 19.
I'm sure that there must be a neater way to do it that I hadn't thought of, or some trick that I've missed. Also it works so far, but I think there may be some edge cases I hadn't thought of. Can anyone advise if there's a better way to do it?
(The XY problem, if anyone asks)
I'm writing a tab completion script, and trying to emulate the functionality of COMP_WORDS and COMP_CWORD, using COMP_LINE and COMP_POINT. When a user is typing a command to tab complete, I want to work out which word they are trying to tab complete just based on the latter two variables. I don't want to outsource this code to Python because performance takes a big hit when Python is involved in tab complete.

Another way in bash without array.
#!/bin/bash
string="foo bar apple banana q xy"
wordAtIndex() {
local index=$1 string=$2 ret='' last first
if [ "${string:index:1}" != " " ] ; then
last="${string:index}"
first="${string:0:index}"
ret="${first##* }${last%% *}"
fi
echo "$ret"
}
for ((i=0; i < "${#string}"; ++i)); do
printf '%s <-- "%s"\n' "${string:i:1}" "$(wordAtIndex "$i" "$string")"
done

if anyone knows of a better way to do this within bash only
Use regexes. With ^.{4} you can skip the first four letters to navigate to index 4. With [[:graph:]]* you can match the rest of the word at that index. * is greedy and will match as many graphical characters as possible.
wordAtIndex() {
local index=$1 string=$2 left right indexFromRight
[[ "$string" =~ ^.{$index}([[:graph:]]*) ]]
right=${BASH_REMATCH[1]}
((indexFromRight=${#string}-index-1))
[[ "$string" =~ ([[:graph:]]*).{$indexFromRight}$ ]]
left=${BASH_REMATCH[1]}
echo "$left${right:1}"
}
And here is full test for your example:
string="foo bar apple banana q xy"
for ((i=0; i < "${#string}"; ++i)); do
printf '%s <-- "%s"\n' "${string:i:1}" "$(wordAtIndex "$i" "$string")"
done
This outputs the input string vertically on the left, and on each index extracts the word that index points to on the right.
f <-- "foo"
o <-- "foo"
o <-- "foo"
<-- ""
b <-- "bar"
a <-- "bar"
r <-- "bar"
<-- ""
<-- ""
<-- ""
a <-- "apple"
p <-- "apple"
p <-- "apple"
l <-- "apple"
e <-- "apple"
<-- ""
<-- ""
b <-- "banana"
a <-- "banana"
n <-- "banana"
a <-- "banana"
n <-- "banana"
a <-- "banana"
<-- ""
q <-- "q"
<-- ""
x <-- "xy"
y <-- "xy"

Related

Loading text as a string, then extracting items

My task is to write a .sh script that will load the user's first name. Then it will use a loop to count the occurrences of the letter 'a' and then print their number.
I understand that it is loading the text:
read / p "Please enter some text" text
Only then referring to the element $ {text [0]} gets all the text, not its single element
#!/bin/bash
echo "Please write"
read b
if [ ${b:${#b}-1:1} -eq 'a' ] ; then
echo "Women"
else
echo "man"
fi
l=0
for (( i=0 ; i< ${#b} ; i++ )) do
if [ ${b:$i:1} -eq 'a' ] ; then
((l++))
fi
done
echo L = $l

For counting the number of a characters in a variable, you could erase first all characters which are not an a. Example:
text=abcaagg
atext=${text//[!a]/}
The variable atext now holds only aaa. Calculate the length of that string, and you know how many a you had in your original variable:
echo ${#atext}
UPDATE: By request, I quote here the part of the bash man page which eplains the substitution. It is stated in the section titled Parameter Expansion:
${parameter/pattern/string}
Pattern substitution. The pattern is expanded to produce a pattern
just as in pathname expansion. Parameter is expanded and the long‐
est match of pattern against its value is replaced with string. If
pattern begins with /, all matches of pattern are replaced with
string. Normally only the first match is replaced. If pattern be‐
gins with #, it must match at the beginning of the expanded value
of parameter. If pattern begins with %, it must match at the end
of the expanded value of parameter. If string is null, matches of
pattern are deleted and the / following pattern may be omitted. If
the nocasematch shell option is enabled, the match is performed
without regard to the case of alphabetic characters.

How preserve space separated groups in bash

I want to build a string with contains quoted groups of words.
These groups should go to same function argument.
I tried to play with arrays.
Literally constructed arrays works, but I still hope to find
a magic syntax hack for bare string.
# literal array
LA=(a "b c")
function printArgs() { # function should print 2 lines
while [ $# -ne 0 ] ; do print $1 ; shift; done
}
printArgs "${LA[#]}" # works fine
# but how to use string to split only unquoted spaces?
LA="a \"b c\""
printArgs "${LA[#]}" # doesn't work :(
LA=($LA)
printArgs "${LA[#]}" # also doesn't work :(
bash arrays have a problem they are not transferable over conveyor
- (echo/$()).

A dirty approach would be :
#!/bin/bash
LA=(a "b c")
function printArgs()
{ # function should print 2 lines
while [ $# -ne 0 ]
do
echo "${1//_/ }" #Use parameter expansion to globally replace '_' with space
#Do double quote as we don't want to have word splitting
shift
done
}
printArgs "${LA[#]}" # works fine
LA="a b__c" # Use a place holder '_' for space, note the two '_' for two spaces
printArgs $LA #Don't double quote '$LA' here. We wish word splitting to happen. And works fine :-)
Sample Output
a
b c
a
b c
Note that the number of spaces inside grouped entities are preserved
Sidenote
The choice of place-holder is critical here. Hopefully you could find one that won't appear in the actual string.

changing position of character in string bash

i have this string
E="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
any idea how to change the position of the letters with its neighbour if the user enters no
and it will continue changing position until user satisfied with the string OR it has reach end of the string.
is the position of 1st correct? Y/N
N
E=BACDEFGHIJKLMNOPQRSTUVWXYZ
*some of my code here*
are u satisfied? Y/N
N
is the position of 2nd correct? Y/N
N
E=BCADEFGHIJKLMNOPQRSTUVWXYZ
*some of my code here*
are u satisfied? Y/N
N
is the position 3rd correct? Y?N
Y
E=BCADEFGHIJKLMNOPQRSTUVWXYZ
*some of my code here*
are u satisfied? Y/N
N
is the position 4th correct? Y?N
Y
E=BCADEFGHIJKLMNOPQRSTUVWXYZ
*some of my code here*
are u satisfied? Y/N
Y
*exit prog*
any help will greatly appreciated. thanks
edited
i got this code from a forum. worked perfectly. but any idea how to swap next character after it have done once? for example ive done the first position, and i want to run it for the second character? any idea?
dual=ETAOINSHRDLCUMWFGYPBVKJXQZ
phrase='E'
rotat=1
newphrase=$(echo $phrase | tr "${dual:0:26}" "${dual:${rotat}:26}")
echo ${newphrase}

You will have to use a loop.
#!/bin/bash
E="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
echo "$E"
for (( i = 1; i < ${#E}; i++ )); do
echo "Is position $i correct? Y/N"
read answer
if [ "$answer" == "N" -o "$answer" == "n" ]
then
E="${E:0:$i-1}${E:$i:1}${E:$i-1:1}${E:$i+1}"
fi
echo "$E"
echo "Are you satisfied? Y/N"
read answer
if [ "$answer" == "Y" -o "$answer" == "y" ]
then
break
fi
done
The loop iterates over every character of the string. The string altering happens in the first if clause. It's nothing more than basic substring operations. ${E:n} returns the substring of E starting at position n. ${E:n:m} returns the next m characters of E starting at position n . The remaining lines are the handling if the user is satisfied and wants to exit.

With bash, you can extract a substring easily:
${string:position:length}
This syntax allows you to use variable extensions, so it is quite straightforward to swap two consective characters in a string:
E="${dual:0:$rotat}${dual:$((rotat+1)):1}${dual:$rotat:1}${dual:$((rotat+2))}"
Arithmetics may need to be enclosed into $((...)).

From bash man pages:
${parameter:offset}
${parameter:offset:length}
Substring Expansion. Expands to up to length characters of parameter starting at the character specified by offset. If
length is omitted, expands to the substring of parameter starting at the character specified by offset. length and offset are
arithmetic expressions (see ARITHMETIC EVALUATION below). length must evaluate to a number greater than or equal to zero. If
offset evaluates to a number less than zero, the value is used as an offset from the end of the value of parameter. If param-
eter is #, the result is length positional parameters beginning at offset. If parameter is an array name indexed by # or *,
the result is the length members of the array beginning with ${parameter[offset]}. A negative offset is taken relative to one
greater than the maximum index of the specified array. Note that a negative offset must be separated from the colon by at
least one space to avoid being confused with the :- expansion. Substring indexing is zero-based unless the positional parame-
ters are used, in which case the indexing starts at 1.
Examples:
pos=5
E="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
echo "${E:pos:1} # print 6th character (F)
echo "${E:pos} # print from 6th character (F) to the end
What dou you mean when you say "its neighbour"? Excepting first and last characters, every character in the string has two neighbours.
To exchange the "POS" character (starting from 1) and its next one (POS+1):
E="${E:0:POS-1}${E:POS:1}${E:POS-1:1}${E:POS+1}"

Perl: Count number of times a word appears in text and print out surrounding words

I want to do two things:
1) count the number of times a given word appears in a text file
2) print out the context of that word
This is the code I am currently using:
my $word_delimiter = qr{
[^[:alnum:][:space:]]*
(?: [[:space:]]+ | -- | , | \. | \t | ^ )
[^[:alnum:]]*
}x;
my $word = "hello";
my $count = 0;
#
# here, a file's contents are loaded into $lines, code not shown
#
$lines =~ s/\R/ /g; # replace all line breaks with blanks (cannot just erase them, because this might connect words that should not be connected)
$lines =~ s/\s+/ /g; # replace all multiple whitespaces (incl. blanks, tabs, newlines) with single blanks
$lines = " ".$lines." "; # add a blank at beginning and end to ensure that first and last word can be found by regex pattern below
while ($lines =~ m/$word_delimiter$word$word_delimiter/g ) {
++$count;
# here, I would like to print the word with some context around it (i.e. a few words before and after it)
}
Three problems:
1) Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words? Of course, I would not want to separate hyphenated words, etc. [Note: I am using UTF-8 throughout but only English and German text; and I understand what reasonably separates a word might be a matter of judgment]
2) When the file to be analzed contains text like "goodbye hello hello goodbye", the counter is incremented only once, because the regex only matches the first occurence of " hello ". After all, the second time it could find "hello", it is not preceeded by another whitespace. Any ideas on how to catch the second occurence, too? Should I maybe somehow reset pos()?
3) How to (reasonably efficiently) print out a few words before and after any matched word?
Thanks!

1. Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words?
Word characters are denoted by the character class \w. It also matches digits and characters from non-roman scripts.
\W represents the negated sense (non-word characters).
\b represents a word boundary and has zero-length.
Using these already available character classes should suffice.
2. Any ideas on how to catch the second occurence, too?
Use zero-length word boundaries.
while ( $lines =~ /\b$word\b/g ) {
++$count;
}

Use grep to remove words from dictionary whose roots are already present

I am trying to write a random passphrase generator. I have a dictionary with a bunch of words and I would like to remove words whose root is already in the dictionary, so that a dictionary that looks like:
ablaze
able
abler
ablest
abloom
ably
would end up with only
ablaze
able
abloom
ably
because abler and ablest contain able which was previously used.
I would prefer to do this with grep so that I can learn more about how that works. I am capable of writing a program in c or python that will do this.

If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words
If the current word matches the prefix regex r (defined in a moment) and the prefix p (ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r) and also remember the prefix string in p.
If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k and p entirely too, I guess.
awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words

This is based on the assumption that the input file is sorted. In that case, when looking up each word, all matches after the first one can be safely skipped (because they will correspond to "the same word with a different suffix").
#/bin/bash
input=$1
while read -r word ; do
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this line
echo $word
# skip next lines that start with $word as prefix
skip=$(grep -c -E -e "^${word}" $input)
for ((i=1; i<$skip; i++)) ; do read -r word ; done
done <$input
Call as ./filter.sh input > output
This takes somewhat less than 2 minutes on all words of 4 or more letters found in my /usr/share/dict/american-english dictionary. The algorithm is O(n²), and therefore unsuitable for large files.
However, you can speed things up a lot if you avoid using grep at all. This version takes only 4 seconds to do the job (because it does not need to scan the whole file almost once per word). Since it performs a single pass over the input, its complexity is O(n):
#/bin/bash
input=$1
while true ; do
# use already-read word, or fail if cannot read new
if [ -n "$next" ] ; then word=$next; unset next;
elif ! read -r word ; then break; fi
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this word
echo ${word}
# skip words that start with $word as prefix
while read -r next ; do
unique=${next#$word}
if [ ${#next} -eq ${#unique} ] ; then break; fi
done
done <$input

Supposing you want to start with words that share the same first four (up to ten) letters, you could do something like this:
cp /usr/share/dict/words words
str="...."
for num in 4 5 6 7 8 9 10; do
for word in `grep "^$str$" words`; do
grep -v "^$word." words > words.tmp
mv words.tmp words
done
str=".$str"
done
You wouldn't want to start with 1 letter, unless 'a' is not in your dictionary, etc.

Try this BASH script:
a=()
while read -r w; do
[[ ${#a[#]} -eq 0 ]] && a+=("$w") && continue
grep -qvf <(printf "^%s\n" "${a[#]}") <<< "$w" && a+=("$w")
done < file
printf "%s\n" "${a[#]}"
ablaze
able
abloom
ably

It seems like you want to group adverbs together. Some adverbs, including those that can also be adjectives, use er and est to form comparisons:
able, abler, ablest
fast, faster, fastest
soon, sooner, soonest
easy, easier, easiest
This procedure is know as stemming in natural language processing, and can be achieved using a stemmer or lemmatizer. there are popular implementations in python's NLTK module but the problem is not completely solved. The best out the box stemmer is the snowball stemmer but it does not stem adverbs to their root.
import nltk
initial = '''
ablaze
able
abler
ablest
abloom
ably
fast
faster
fastest
'''.splitlines()
snowball = nltk.stem.snowball.SnowballStemmer("english")
stemmed = [snowball.stem(word) for word in initial]
print set(stemmed)
output...
set(['', u'abli', u'faster', u'abl', u'fast', u'abler', u'abloom', u'ablest', u'fastest', u'ablaz'])
the other option is to use a regex stemmer but this has its own difficulties I'm afraid.
patterns = "er$|est$"
regex_stemmer = nltk.stem.RegexpStemmer(patterns, 4)
stemmed = [regex_stemmer.stem(word) for word in initial]
print set(stemmed)
output...
set(['', 'abloom', 'able', 'abl', 'fast', 'ably', 'ablaze'])

If you just want to weed out some of the words, this gross command will work. Note that it'll throw out some legit words like best, but it's dead simple. It assumes you have a test.txt file with one word per line
egrep -v "er$|est$" test.txt >> results.txt
egrep is the same as grep -E. -v means throw out matching lines. x|y means if x or y match, and $ means end of line, so you'd be looking for words that end in er or est

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extracting a string from a substring in bash (yes, that way around) - string

Related

Loading text as a string, then extracting items

How preserve space separated groups in bash

changing position of character in string bash

Perl: Count number of times a word appears in text and print out surrounding words

Use grep to remove words from dictionary whose roots are already present

Categories

Resources