How to compare and recursively modify strings in Bash

How to compare and recursively modify strings in Bash - string

I need to write a bash code performing some tasks I am going to explain.
The input: two uppercase strings of same length, no matter
their length is. Es:
CYVFGDDAS --> string1 , unchangeable reference string
CRFDGVEAT --> string2 , modifiable string
I am trying to write Bash code that is able to compare the characters with same index recursively starting from the first position:
-- beginnig of the cycle --
if the characters are the same skip any action and go to the
the next position,
while
if the characters are not the same the character of string1
replaces the character of string2 at that position
the new string2 is saved in a file
a substituion code is also written in the same file (I will
explain this below)
the old string2 is replaced by the new string2 in such a way
its changes are retained
start anothe cycle from the beginning
------
Repeat the cycle until the last character is processed.
So, for the example above, the code should start checking from the
first position where two C characters are placed. They match so no
action is taken and both strings are left unchanged.
Going to he second position Y should replace R in the second string,
the modified string should be saved and written in a text file togheter with the substitution code YA2V (Y is the replacing character of string1, A is a costant character that must be present in all substitutions codes, 2 is the positional index where the substitution occurred, and V is the replaced character of string2).
I am proficient in Python which has a large number of modules for string manipulation but because the code should be added to a pre-existing Bash program I need to get this done in Bash environment (builtin commands, awk, sed etc, does not matter). Looks to me that Bash does not have an extended arsenal of tools like Python, so I am first of all wondering if this project is feasible or not.
However, what I tried so far is to convert the strings in blank
separated fields by inserting spaces between the characters in such
a way awk can deal better with them as fields but I did not go very
far with this.
Sorry for the lengthy explanation. Any help is greatly appreciated.

No recursion is needed, just iterate over the strings. You can use parameter expansion with a for loop:
#!/bin/bash
s1=CYVFGDDAS
s2=CRFDGVEAT
for ((i=0; i<${#s1} ; ++i)) ; do
if [[ ${s1:i:1} != ${s2:i:1} ]] ; then
printf '%s\n' "${s1:0:i+1}${s2:i+1}"
printf '%s\n' "${s1:i:1}A$((i+1))${s2:i:1}"
fi
done
${s1:i:1} means extract the substring of $s1 from position $i of length 1. If the length is omitted, it extracts as much as it can.
It just outputs the strings, redirect them to files as you need.
CYFDGVEAT
YA2R
CYVDGVEAT
VA3F
CYVFGVEAT
FA4D
CYVFGDEAT
DA6V
CYVFGDDAT
DA7E
CYVFGDDAS
SA9T

Related

Bash pattern matching newline character in shell parameter expansion

I have a long variable in my bash script. I'm trying to iterate over it in chunks to do some processing. It's largely working:
while [ ${#REMAINING_PLAN} -gt 0 ] ; do
CURRENT_PLAN=${REMAINING_PLAN::65300} # Truncate to 65k and iterate
# problematic line:
CURRENT_PLAN=${CURRENT_PLAN%'\\n'*} # trim truncated string to the last newline
PROCESSED_PLAN_LENGTH=$((PROCESSED_PLAN_LENGTH+${#CURRENT_PLAN})) # evaluate length of outbound batch and store
# do some stuff not shown
REMAINING_PLAN=${REMAINING_PLAN:PROCESSED_PLAN_LENGTH}
done
I'm trying to truncate a variable to a max length, then further strip everything up to the last new line in the file, so that my next 'batch' starts its processing on a fresh line. But this statement isn't doing what I intend:
CURRENT_PLAN=${CURRENT_PLAN%'\\n'*} # does not actually trim truncated string to the last newline
What's wrong with it and how can I trim a string to the last instance of a newline?

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this in bash to strip off all characters after last line break:
CURRENT_PLAN="${CURRENT_PLAN%$'\n'*}"
$'\n' is C like construct that is used in bash to denote a line break.

inserting a number from stdout into a string from stdout

I'm working on a Linux terminal.
I have a string followed by a number as stdout and I need a command that replaces the middle of the string by the number and writes the result to stdout.
This is the string and number: librarian 16
and this is what the output should be: l16n
I have tried using echo librarian 16|sed s/[a-z]*/16/g and this gives me 9 999 the problems are that it replaces every letter separitaly and that it also replaces the first and last letter and that I can't make it use the number from stdout.
I have also tried using cut -c 1-1 , sed s/[^0-9]*//g and cut-c 9-9 to generate l, 16 and n respectively but I can't find how to combine their outputs into a single line.
Lastly I have tried using text editors to copy the number and paste it into the string but I haven't made much progress since I don't know how to use editors directly from the command line.

So what you want is to capture the first letter, the last letter and the number while ignoring the middle.
In regex we use ( and ) to tell the engine what we want to capture, anything else simply gets matched, or "eaten", but not captured. So the pattern should look like this:
([a-z])[a-z]*([a-z]) ([0-9]+)
([a-z]) to capture the first letter
[a-z]* to match zero or more characters but not capture. We choose "*" here because there might not be anything to match in the middle, like when there are two or less letters.
([a-z]) to capture the last letter.
to "eat" the whitespace.
([0-9]+) to capture the number. We use + instead of * because we require a number at this position.
sed uses a different syntax for some fo these constructs so we'll use the -E flag. You could do without it but you'd have to escape the ()+ characters which IMO makes pattern a little bit confusing.
Now, to retrieve the captured content, we have to use an engine-specific sequence of characters. sed uses \n where n is the number of the capturing group, so our final pattern should look like this:
\1\3\2
\1: First letter
\3: Number
\2: Last letter
Now we put everything together:
$ echo librarian 16|sed -r 's/([a-z])[a-z]*([a-z]) ([0-9]+)/\1\3\2/g'
l16n

Bash split an array, add a variable and concatenate it back together

I've been trying to figure this out, unfortunately I can't. I am trying to create a function that finds the ';' character, puts four spaces before it and then and puts the code back together in a neat sentence. I've been cracking at this for a bit, and can't figure out a couple of things. I can't get the output to display what I want it to. I've tried finding the index of the ';' character and it seems I'm going about it the wrong way. The other mistake that I seem to be making is that I'm trying to split in a array in a for loop, and then split the individual words in the array by letter but I can't figure out how to do that either. If someone can give me a pointer this would be greatly appreciated. This is in bash version 4.3.48
#!commentPlacer()
{
arg=($1) #argument
len=${#arg[#]} #length of the argument
comment=; #character to look for in second loop
commaIndex=(${arg[#]#;}) #the attempted index look up
commentSpace=" ;" #the variable being concatenated into the array
for(( count1=0; count1 <= ${#arg[#]}; count1++ )) #search the argument looking for comment space
do if [[ ${arg[count1]} != commentSpace ]] #if no commentSpace variable then
then for (( count2=0; count2 < ${#arg[count1]} ; count2++ )) #loop through again
do if [[ ${arg[count2]} != comment ]] #if no comment
then A=(${arg[#]:0:commaIndex})
A+=(commentSpace)
A+=(${arg[#]commaIndex:-1}) #concatenate array
echo "$A"
fi
done
fi
done
}

If I understand what you want correctly, it's basically to put 4 spaces in front of each ";" in the argument, and print the result. This is actually simple to do in bash with a string substitution:
commentPlacer() {
echo "${1//;/ ;}"
}
The expansion here has the format ${variable//pattern/replacement}, and it gives the contents of the variable, with each occurrence of pattern replaced by replacement. Note that with only a single / before the pattern, it would replace only the first occurrence.
Now, I'm not sure I understand how your script is supposed to work, but I see several things that clearly aren't doing what you expect them to do. Here's a quick summary of the problems I see:
arg=($1) #argument
This doesn't create an array of characters from the first argument. var=(...) treats the thing in ( ) as a list of words, not characters. Since $1 isn't in double-quotes, it'll be split into words based on whitespace (generally spaces, tabs, and linefeeds), and then any of those words that contain wildcards will be expanded to a list of matching filenames. I'm pretty sure this isn't at all what you want (in fact, it's almost never what you want, so variable references should almost always be double-quoted to prevent it). Creating a character array in bash isn't easy, and in general isn't something you want to do. You can access individual characters in a string variable with ${var:index:1}, where index is the character you want (counting from 0).
commaIndex=(${arg[#]#;}) #the attempted index look up
This doesn't do a lookup. The substitution ${var#pattern} gives the value of var with pattern removed from the front (if it matches). If there are multiple possible matches, it uses the shortest one. The variant ${var##pattern} uses the longest possible match. With ${array[#]#pattern}, it'll try to remove the pattern from each element -- and since it's not in double-quotes, the result of that gets word-split and wildcard-expanded as usual. I'm pretty sure this isn't at all what you want.
if [[ ${arg[count1]} != commentSpace ]] #if no commentSpace variable then
Here (and in a number of other places), you're using a variable without $ in front; this doesn't use the variable at all, it just treats "commentSpace" as a static string. Also, in several places it's important to have double-quotes around it, e.g. to keep the spaces in $commentSpace from vanishing due to word splitting. There are some places where it's safe to leave the double-quotes off, but in general it's too hard to keep track of them, so just use double-quotes everywhere.
General suggestions: don't try to write c (or java or whatever) programs in bash; it works too differently, and you have to think differently. Use shellcheck.net to spot common problems (like non-double-quoted variable references). Finally, you can see what bash is doing by putting set -x before a section that doesn't do what you expect; that'll make bash print each line as it executes it, showing the equivalent of what it's executing.

Make a little function using pattern substitution on stdin:
semicolon4s() { while read x; do echo "${x//;/ ;}"; done; }
semicolon4s <<< 'foo;bar;baz'
Output:
foo ;bar ;baz

bash difference between raw string and string in variable

I wrote a little script in bash, but it only worked when I stored the string as a variable, and I'd like to know why. Here's the summary:
When I use the string itself, bash treats it as a single entity
for word in "this is a sentence"; do
echo $word
done
# => this is a sentence
If I save the exact same string into a variable, bash iterates over the words
sentence="this is a sentence"
for word in $sentence; do
echo $word
done
# => this
# is
# a
# sentence
Why are these being treated differently?
Is there a simple way to iterate through the words in the string without first saving the string as a variable?

The quotes tell bash to treat a thing in quotes as a single parameter in a parameter list at the time the expression is evaluated. The quotes (unless protected with \ or ') are removed.
echo "" # prints newlines, no quotes
echo '""' # Print ""
export X='""'
env | grep X # X contains ""
export X=""
env | grep X # X is empty
When you use a variable, bash unpacks it as is (i.e. as if you typed the variable's contents in the variable's place). For a for-loop bash determines the list-elements to iterate over by separating the for-loop's parameters by whitespace, but treating (as always) quote-protected items a single parameter/list-element. Your variable contained no quotes -- items are treated as separate parameters.

As comments suggested, quotes are important. A for loop will step through a list of values terminated by a semicolon, and that list is a set of strings. Unquoted strings are delimited usually by whitespace. Whitespace inside a quoted string does not separate the string from its brethren, it's simply part of the quoted string. There's some truly excellent documentation about quotes in bash at http://mywiki.wooledge.org/Quotes . Read it. Read it now. You'll find a part that says
The quotes are not actually passed along to the command. They are removed by the shell (this process is cleverly called "quote removal").
To step through the words in a sentence that's stored in a variable (if I've inferred your question correctly), you could perhaps use an array to separate the words by whitespace:
#!/bin/bash
sentence="this is a sentence"
IFS=" " read -a words <<< "$sentence"
for word in "${words[#]}"; do
echo "$word"
done
In bash, read -a will divide a string by $IFS and place the divided parts into elements of the array. See http://mywiki.wooledge.org/BashGuide/Arrays for more information about how bash arrays work.
If you want more details in pursuit of a specific problem, you might want to tell us what the problem is, or risk making this an XY problem.

In the assignment
sentence="this is a sentence"
there are no unquoted spaces, so everything to the right of the = is treated as a single word. (Something like sentence=this is a sentence would be parsed as a single assignment sentence=this followed by an attempt to run a program called is.) As a result, the value of sentences is a sequence of 18 characters. It is identical to
sentence=this\ is\ a\ sentence
because again, there are no unquoted spaces.
For the same reason
for word in "this is a sentence"; do
echo $word
done
has word being set to each word in the following sequence, which only contains a single word because there are no unquoted spaces.
The key difference with your other loop is that parameter expansions are subject to word-splitting after the fact. The loop
for word in $sentence; do
echo $word
done
after parameter expansion looks like
for word in this is a sentence; do
echo $word
done
so now word is set to each of the 4 words in the list following the in keyword.
It's not clear what you are actually asking at the end of your question, but the preceding is legal code. There is no requirement that a string be placed in quotes in bash; quotes do not define something as a string value, but simply escape every character that appears within the quotes. "foo" and \f\o\o are the same thing in shell.

Quoting turns any string into a single unit. If you lose the quotes, everything should be fine.

A Linux Shell Script Problem

I have a string separated by dot in Linux Shell,
$example=This.is.My.String
I want to
1.Add some string before the last dot, for example, I want to add "Good.Long" before the last dot, so I get:
This.is.My.Goood.Long.String
2.Get the part after the last dot, so I will get
String
3.Turn the dot into underscore except the last dot, so I will get
This_is_My.String
If you have time, please explain a little bit, I am still learning Regular Expression.
Thanks a lot!

I don't know what you mean by 'Linux Shell' so I will assume bash. This solution will also work in zsh, etcetera:
example=This.is.My.String
before_last_dot=${example%.*}
after_last_dot=${example##*.}
echo ${before_last_dot}.Goood.Long.${after_last_dot}
This.is.My.Goood.Long.String
echo ${before_last_dot//./_}.${after_last_dot}
This_is_My.String
The interim variables before_last_dot and after_last_dot should explain my usage of the % and ## operators. The //, I also think is self-explanatory but I'd be happy to clarify if you have any questions.
This doesn't use sed (or even regular expressions), but bash's inbuilt parameter substitution. I prefer to stick to just one language per script, with as few forks as possible :-)

Other users have given good answers for #1 and #2. There are some disadvantages to some of the answers for #3. In one case, you have to run the substitution twice. In another, if your string has other underscores they might get clobbered. This command works in one go and only affects dots:
sed 's/\(.*\)\./\1\n./;h;s/[^\n]*\n//;x;s/\n.*//;s/\./_/g;G;s/\n//'
It splits the line before the last dot by inserting a newline and copies the result into hold space:
s/\(.*\)\./\1\n./;h
removes everything up to and including the newline from the copy in pattern space and swaps hold space and pattern space:
s/[^\n]*\n//;x
removes everything after and including the newline from the copy that's now in pattern space
s/\n.*//
changes all dots into underscores in the copy in pattern space and appends hold space onto the end of pattern space
s/\./_/g;G
removes the newline that the append operation adds
s/\n//
Then the sed script is finished and the pattern space is output.
At the end of each numbered step (some consist of two actual steps):
Step Pattern Space Hold Space
This.is.My\n.String This.is.My\n.String
This.is.My\n.String .String
This.is.My .String
This_is_My\n.String .String
This_is_My.String .String

Solution
Two versions of this, too:
Complex: sed 's/\(.*\)\([.][^.]*$\)/\1.Goood.Long\2/'
Simple: sed 's/.*\./&Goood.Long./' - thanks Dennis Williamson
What do you want?
Complex: sed 's/.*[.]\([^.]*\)$/\1/'
Simpler: sed 's/.*\.//' - thanks, glenn jackman.
sed 's/\([^.]*\)[.]\([^.]*[.]\)/\1_\2/g'
With 3, you probably need to run the substitute (in its entirety) at least twice, in general.
Explanation
Remember, in sed, the notation \(...\) is a 'capture' that can be referenced as '\1' or similar in the replacement text.
Capture everything up to a string starting with a dot followed by a sequence of non-dots (which you also capture); replace by what came before the last dot, the new material, and the last dot and what came after it.
Ignore everything up to the last dot followed by a capture of a sequence of non-dots; replace with the capture only.
Find and capture a sequence of non-dots, a dot (not captured), followed by a sequence of non-dots and a dot; replace the first dot with an underscore. This is done globally, but the second and subsequent matches won't touch anything already matched. Therefore, I think you need ceil(log2N) passes, where N is the number of dots to be replaced. One pass deals with 1 dot to replace; two passes deals with 2 or 3; three passes deals with 4-7, and so on.

Here's a version that uses Bash's regex matching (Bash 3.2 or greater).
[[ $example =~ ^(.*)\.(.*)$ ]]
echo ${BASH_REMATCH[1]//./_}.${BASH_REMATCH[2]}
Here's a Bash version that uses IFS (Internal Field Separator).
saveIFS=$IFS
IFS=.
array=($e) # * split the string at each dot
lastword=${array[#]: -1}
unset "array[${#array}-1]" # *
IFS=_
echo "${array[*]}.$lastword" # The asterisk as a subscript when inside quotes causes IFS (an underscore in this case) to be inserted between each element of the array
IFS=$saveIFS
* use declare -p array after these steps to see what the array looks like.

1.
$ echo 'This.is.my.string' | sed 's}[^\.][^\.]*$}Good Long.&}'
This.is.my.Good Long.string
before: a dot, then no dot until the end. after: obvious, & is what matched the first part
2.
$ echo 'This.is.my.string' | sed 's}.*\.}}'
string
sed greedy matches, so it will extend the first closure (.*) as far as possible i.e. to the last dot.
3.
$ echo 'This.is.my.string' | tr . _ | sed 's/_\([^_]*\)$/\.\1/'
This_is_my.string
convert all dots to _, then turn the last _ to a dot.
(caveat: this will turn 'This.is.my.string_foo' to 'This_is_my_string.foo', not 'This_is_my.string_foo')

You don't need regular expressions at all (those complex things hurt my eyes!) if you use Awk and are a little creative.
1. echo $example| awk -v ins="Good.long" -F . '{OFS="."; $NF = ins"."$NF;print}'
What this does:
-v ins="Good.long" tells awk to create a variable called 'ins' with "Good.long" as content,
-F . tells awk to use the dot as a separator for your fields for input,
-OFS tells awk to use the dot as a separator for your fields as output,
NF is the number of fields, so $NF represents the last field,
the $NF=... part replaces the last field, it appends the current last string to what you want to insert (the variable called "ins" declared earlier).
2. echo $example| awk -F . '{print $NF}'
$NF is the last field, so that's all!
3. echo $example| awk -F . '{OFS="_"; $(NF-1) = $(NF-1)"."$NF; NF=NF-1; print}'
Here we have to be creative, as Awk AFAIK doesn't allow deleting fields. Of course, we set the output field separateor to underscore.
$(NF-1) = $(NF-1)"."$NF: First, we replace the second last field with the last glued to the second last, with a dot between.
Then, we fool awk to make it think the Number of fields is equal to the number of fields minus one, hence deleting the last field!
Note you can't say $NF="", because then it would display two underscores.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string