Bash pattern matching newline character in shell parameter expansion

Bash pattern matching newline character in shell parameter expansion - string

I have a long variable in my bash script. I'm trying to iterate over it in chunks to do some processing. It's largely working:
while [ ${#REMAINING_PLAN} -gt 0 ] ; do
CURRENT_PLAN=${REMAINING_PLAN::65300} # Truncate to 65k and iterate
# problematic line:
CURRENT_PLAN=${CURRENT_PLAN%'\\n'*} # trim truncated string to the last newline
PROCESSED_PLAN_LENGTH=$((PROCESSED_PLAN_LENGTH+${#CURRENT_PLAN})) # evaluate length of outbound batch and store
# do some stuff not shown
REMAINING_PLAN=${REMAINING_PLAN:PROCESSED_PLAN_LENGTH}
done
I'm trying to truncate a variable to a max length, then further strip everything up to the last new line in the file, so that my next 'batch' starts its processing on a fresh line. But this statement isn't doing what I intend:
CURRENT_PLAN=${CURRENT_PLAN%'\\n'*} # does not actually trim truncated string to the last newline
What's wrong with it and how can I trim a string to the last instance of a newline?

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this in bash to strip off all characters after last line break:
CURRENT_PLAN="${CURRENT_PLAN%$'\n'*}"
$'\n' is C like construct that is used in bash to denote a line break.

Related

How to compare and recursively modify strings in Bash

I need to write a bash code performing some tasks I am going to explain.
The input: two uppercase strings of same length, no matter
their length is. Es:
CYVFGDDAS --> string1 , unchangeable reference string
CRFDGVEAT --> string2 , modifiable string
I am trying to write Bash code that is able to compare the characters with same index recursively starting from the first position:
-- beginnig of the cycle --
if the characters are the same skip any action and go to the
the next position,
while
if the characters are not the same the character of string1
replaces the character of string2 at that position
the new string2 is saved in a file
a substituion code is also written in the same file (I will
explain this below)
the old string2 is replaced by the new string2 in such a way
its changes are retained
start anothe cycle from the beginning
------
Repeat the cycle until the last character is processed.
So, for the example above, the code should start checking from the
first position where two C characters are placed. They match so no
action is taken and both strings are left unchanged.
Going to he second position Y should replace R in the second string,
the modified string should be saved and written in a text file togheter with the substitution code YA2V (Y is the replacing character of string1, A is a costant character that must be present in all substitutions codes, 2 is the positional index where the substitution occurred, and V is the replaced character of string2).
I am proficient in Python which has a large number of modules for string manipulation but because the code should be added to a pre-existing Bash program I need to get this done in Bash environment (builtin commands, awk, sed etc, does not matter). Looks to me that Bash does not have an extended arsenal of tools like Python, so I am first of all wondering if this project is feasible or not.
However, what I tried so far is to convert the strings in blank
separated fields by inserting spaces between the characters in such
a way awk can deal better with them as fields but I did not go very
far with this.
Sorry for the lengthy explanation. Any help is greatly appreciated.

No recursion is needed, just iterate over the strings. You can use parameter expansion with a for loop:
#!/bin/bash
s1=CYVFGDDAS
s2=CRFDGVEAT
for ((i=0; i<${#s1} ; ++i)) ; do
if [[ ${s1:i:1} != ${s2:i:1} ]] ; then
printf '%s\n' "${s1:0:i+1}${s2:i+1}"
printf '%s\n' "${s1:i:1}A$((i+1))${s2:i:1}"
fi
done
${s1:i:1} means extract the substring of $s1 from position $i of length 1. If the length is omitted, it extracts as much as it can.
It just outputs the strings, redirect them to files as you need.
CYFDGVEAT
YA2R
CYVDGVEAT
VA3F
CYVFGVEAT
FA4D
CYVFGDEAT
DA6V
CYVFGDDAT
DA7E
CYVFGDDAS
SA9T

inserting a number from stdout into a string from stdout

I'm working on a Linux terminal.
I have a string followed by a number as stdout and I need a command that replaces the middle of the string by the number and writes the result to stdout.
This is the string and number: librarian 16
and this is what the output should be: l16n
I have tried using echo librarian 16|sed s/[a-z]*/16/g and this gives me 9 999 the problems are that it replaces every letter separitaly and that it also replaces the first and last letter and that I can't make it use the number from stdout.
I have also tried using cut -c 1-1 , sed s/[^0-9]*//g and cut-c 9-9 to generate l, 16 and n respectively but I can't find how to combine their outputs into a single line.
Lastly I have tried using text editors to copy the number and paste it into the string but I haven't made much progress since I don't know how to use editors directly from the command line.

So what you want is to capture the first letter, the last letter and the number while ignoring the middle.
In regex we use ( and ) to tell the engine what we want to capture, anything else simply gets matched, or "eaten", but not captured. So the pattern should look like this:
([a-z])[a-z]*([a-z]) ([0-9]+)
([a-z]) to capture the first letter
[a-z]* to match zero or more characters but not capture. We choose "*" here because there might not be anything to match in the middle, like when there are two or less letters.
([a-z]) to capture the last letter.
to "eat" the whitespace.
([0-9]+) to capture the number. We use + instead of * because we require a number at this position.
sed uses a different syntax for some fo these constructs so we'll use the -E flag. You could do without it but you'd have to escape the ()+ characters which IMO makes pattern a little bit confusing.
Now, to retrieve the captured content, we have to use an engine-specific sequence of characters. sed uses \n where n is the number of the capturing group, so our final pattern should look like this:
\1\3\2
\1: First letter
\3: Number
\2: Last letter
Now we put everything together:
$ echo librarian 16|sed -r 's/([a-z])[a-z]*([a-z]) ([0-9]+)/\1\3\2/g'
l16n

Why does a part of this variable get replaced when combining it with a string?

I have the following Bash script which loops through the lines of a file:
INFO_FILE=playlist-info-test.txt
line_count=$(wc -l $INFO_FILE | awk '{print $1}')
for ((i=1; i<=$line_count; i++))
do
current_line=$(sed "${i}q;d" $INFO_FILE)
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo $input_file
done
This is a sample of the playlist-info-test.txt file:
Playlist 1
Playlist2
Playlist 3
The output of the script should be as follows:
Playlist 1.mp3
Playlist2.mp3
Playlist 3.mp3
However, I am getting the following output:
.mp3list 1
.mp3list2
.mp3list 3
I have spent a few hours on this and can't understand why the ".mp3" part is being moved to the front of the string. I initially thought it was because of the space in the lines of the input file, but removing the space doesn't make a difference. I also tried using a while loop with read line and the input file redirected into it, but that does not make any difference either.

I copied the playlist-info-test.txt contents and the script, and get the output you expected. Most likely there are non-printable characters in your playlist-info-test.txt or script which are messing up the processing. Check the binary contents of both files using for example xxd -g 1 and look for non-newline (0a) non-printing characters.

Did the file come from Windows? DOS and Windows end their lines with carriage return (hex 0d, sometimes represented as \r) followed by linefeed (hex 0a, sometimes represented as \n). Unix just uses linefeed, and so tends to treat the carriage return as part of the content of the line. In your case, it winds up at the end of the current_line variable, so input_file winds up something like "Playlist 1\r.mp3". When you print this to the terminal, the carriage return makes it go back to the beginning of the line (that's what carriage return means), so it prints as:
Playlist 1
.mp3
...with the ".mp3" printed over the "Play" part, rather than on the next line like I have it above.
Solution: either fix the file (there's a fairly standard dos2unix program that does precisely this), or change your script to strip carriage returns as it reads the file. Actually, I'd recommend a rewrite anyway, since your current use of sed to pick out lines is rather weird and inefficient. In a shell script, the standard way to read through a file line-by-line is to use a loop like while read -r current_line; do [commands here]; done <"$INFO_FILE". There's a possible problem that if any commands inside the loop read from standard input, they'll wind up inhaling part of that file; you can fix that by passing the file over unit 3 rather than standard input. With that fix and a trick to trim carriage returns, here's what it looks like:
INFO_FILE=playlist-info-test.txt
while IFS=$' \t\n\r' read -r current_line <&3; do
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo "$input_file"
done 3<"$INFO_FILE"
(The carriage return trim is done by read -- it always auto-trims leading and trailing whitespace, and setting IFS to $' \t\n\r' tells it to treat spaces, tabs, linefeeds, and carriage returns as whitespace. And since that assignment is a prefix to the read command, it applies only to that one command and you don't have to set IFS back to normal afterward.)
A couple of other recommendations while I'm here: double-quote all variable references (as I did with echo "$input_file" above), and avoid all-caps variable names (there are a bunch with special meanings, and if you accidentally use one of those it can have weird effects). Oh, and try passing your scripts to shellcheck.net -- it's good at spotting common mistakes.

bash difference between raw string and string in variable

I wrote a little script in bash, but it only worked when I stored the string as a variable, and I'd like to know why. Here's the summary:
When I use the string itself, bash treats it as a single entity
for word in "this is a sentence"; do
echo $word
done
# => this is a sentence
If I save the exact same string into a variable, bash iterates over the words
sentence="this is a sentence"
for word in $sentence; do
echo $word
done
# => this
# is
# a
# sentence
Why are these being treated differently?
Is there a simple way to iterate through the words in the string without first saving the string as a variable?

The quotes tell bash to treat a thing in quotes as a single parameter in a parameter list at the time the expression is evaluated. The quotes (unless protected with \ or ') are removed.
echo "" # prints newlines, no quotes
echo '""' # Print ""
export X='""'
env | grep X # X contains ""
export X=""
env | grep X # X is empty
When you use a variable, bash unpacks it as is (i.e. as if you typed the variable's contents in the variable's place). For a for-loop bash determines the list-elements to iterate over by separating the for-loop's parameters by whitespace, but treating (as always) quote-protected items a single parameter/list-element. Your variable contained no quotes -- items are treated as separate parameters.

As comments suggested, quotes are important. A for loop will step through a list of values terminated by a semicolon, and that list is a set of strings. Unquoted strings are delimited usually by whitespace. Whitespace inside a quoted string does not separate the string from its brethren, it's simply part of the quoted string. There's some truly excellent documentation about quotes in bash at http://mywiki.wooledge.org/Quotes . Read it. Read it now. You'll find a part that says
The quotes are not actually passed along to the command. They are removed by the shell (this process is cleverly called "quote removal").
To step through the words in a sentence that's stored in a variable (if I've inferred your question correctly), you could perhaps use an array to separate the words by whitespace:
#!/bin/bash
sentence="this is a sentence"
IFS=" " read -a words <<< "$sentence"
for word in "${words[#]}"; do
echo "$word"
done
In bash, read -a will divide a string by $IFS and place the divided parts into elements of the array. See http://mywiki.wooledge.org/BashGuide/Arrays for more information about how bash arrays work.
If you want more details in pursuit of a specific problem, you might want to tell us what the problem is, or risk making this an XY problem.

In the assignment
sentence="this is a sentence"
there are no unquoted spaces, so everything to the right of the = is treated as a single word. (Something like sentence=this is a sentence would be parsed as a single assignment sentence=this followed by an attempt to run a program called is.) As a result, the value of sentences is a sequence of 18 characters. It is identical to
sentence=this\ is\ a\ sentence
because again, there are no unquoted spaces.
For the same reason
for word in "this is a sentence"; do
echo $word
done
has word being set to each word in the following sequence, which only contains a single word because there are no unquoted spaces.
The key difference with your other loop is that parameter expansions are subject to word-splitting after the fact. The loop
for word in $sentence; do
echo $word
done
after parameter expansion looks like
for word in this is a sentence; do
echo $word
done
so now word is set to each of the 4 words in the list following the in keyword.
It's not clear what you are actually asking at the end of your question, but the preceding is legal code. There is no requirement that a string be placed in quotes in bash; quotes do not define something as a string value, but simply escape every character that appears within the quotes. "foo" and \f\o\o are the same thing in shell.

Quoting turns any string into a single unit. If you lose the quotes, everything should be fine.

Split bash string by newline characters

I found this.
And I am trying this:
x='some
thing'
y=(${x//\n/})
And I had no luck, I thought it could work with double backslash:
y=(${x//\\n/})
But it did not.
To test I am not getting what I want I am doing:
echo ${y[1]}
Getting:
some
thing
Which I want to be:
some
I want y to be an array [some, thing]. How can I do this?

Another way:
x=$'Some\nstring'
readarray -t y <<<"$x"
Or, if you don't have bash 4, the bash 3.2 equivalent:
IFS=$'\n' read -rd '' -a y <<<"$x"
You can also do it the way you were initially trying to use:
y=(${x//$'\n'/ })
This, however, will not function correctly if your string already contains spaces, such as 'line 1\nline 2'. To make it work, you need to restrict the word separator before parsing it:
IFS=$'\n' y=(${x//$'\n'/ })
...and then, since you are changing the separator, you don't need to convert the \n to space anymore, so you can simplify it to:
IFS=$'\n' y=($x)
This approach will function unless $x contains a matching globbing pattern (such as "*") - in which case it will be replaced by the matched file name(s). The read/readarray methods require newer bash versions, but work in all cases.

There is another way if all you want is the text up to the first line feed:
x='some
thing'
y=${x%$'\n'*}
After that y will contain some and nothing else (no line feed).
What is happening here?
We perform a parameter expansion substring removal (${PARAMETER%PATTERN}) for the shortest match up to the first ANSI C line feed ($'\n') and drop everything that follows (*).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string