How to split words in bash - linux

Good evening, People
Currently I have an Array called inputArray which stores an input file 7 lines line by line. I have a word which is 70000($s0), how do I split the word so it is 70000 & ($s0) separate?
I looked at an answer which is on this website already but I couldn't understand it the answer I looked at was:
s='1000($s3)'
IFS='()' read a b <<< "$s"
echo -e "a=<$a>\nb=<$b>"
giving the output a=<1000> b=<$s3>

Let me give this a shot.
In certain circumstances, the shell will perform "word splitting", where a string of text is broken up into words. The word boundaries are defined by the IFS variable. The default value of IFS is: space, tab, newline. When a string is to be split into words, any sequence of this set of characters is removes to extract the words.
In your example, the set of characters that delimit words are ( and ). So the words in that string that are bounded by the IFS set of characters are 1000 and $s3
What is <<< "$s"? This is a here-string. It's used to send a string to some command's standard input. It's like doing
echo "$s" | read a b
except that form doesn't work as expected in bash. read a b <<< "$s" works well.
Now, what are the circumstances where word splitting occurs? One is when a variable is unquoted. A demo:
IFS='()'
echo "$s" | wc # 1 line, 1 word and 10 characters
echo $s | wc # 1 line, 2 words and 9 characters
The read command also splits a string into words, in order to assign words to the named variables. The variable a gets the first word, and b gets all the rest.
The command, broken down is:
IFS='()' read a b <<< "$s"
# ^^^^^^^ 1
# ^^^^^^^^ 2
# ^^^^^^^^ 3
only for the duration of the read command, assign the variable IFS the value ()
send the string "$s" to read's stdin
from stdin, use $IFS to split the input into words: assign the first word to variable a and the rest of the string to variable b. Trailing characters from $IFS at the end of the string are discarded.
Documentation:
Word splitting
Here strings
Simple command execution, describing why this assignment of IFS is only in effect for the duration of the read command.
read command
Hope that helps.

Related

Way to replace one variable with another in a string

I need to replace one variable with another variable in a multiple strings.
For example:
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in string1 string2 string3; do
x="$(echo "$str" | sed 's/[a-zA-Z]//g')" # extracting a character between letters
sed 's/$x/$y/'$str # I tried this, but it does not work at all.
echo "$str"
done
Expecting output:
One;two
three;four
five;six
In my output, nothing changes:
One,two
three.four
five:six
You can use bash's substitution operator instead of sed. And simply replace anything that isn't a letter with $y.
#!/bin/bash
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in "$string1" "$string2" "$string3"; do
x=${str//[^a-zA-Z]+/$y}
echo "$x"
done
Output is:
One;two
three;four
five;six
Note that your general approach wouldn't work if the input string has muliple delimiters, e.g. One,two,three. When you remove all the letters you get ,,, but that doesn't appear anywhere in the string.
Addressing issues with OP's current code:
referencing variables requires a leading $, preferably a pair of {}, and (usually) double quotes (eg, to insure embedded spaces are considered as part of the variable's value)
sed can take as input a) a stream of text on stdin, b) a file, c) process substitution or d) a here-document/here-string
when building a sed script that includes variable refences the sed script must be wrapped in double quotes (not single quotes)
Pulling all of this into OP's current code we get:
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in "${string1}" "${string2}" "${string3}"; do # proper references of the 3x "stringX" variables
x="$(echo "$str" | sed 's/[a-zA-Z]//g')"
sed "s/$x/$y/" <<< "${str}" # feeding "str" as here-string to sed; allowing variables "x/y" to be expanded in the sed script
echo "$str"
done
This generates:
One;two # generated by the 2nd sed call
One,two # generated by the echo
;hree.four # generated by the 2nd sed call
three.four # generated by the echo
five;six # generated by the 2nd sed call
five:six # generated by the echo
OK, so we're now getting some output but there are obviously some issues:
the results of the 2nd sed call are being sent to stdout/terminal as opposed to being captured in a variable (presumably the str variable - per the follow-on echo ???)
for string2 we find that x=. which when plugged into the 2nd sed call becomes sed "s/./;/"; from here the . matches the first character it finds which in this case is the 1st t in string2, so the output becomes ;hree.four (and the . is not replaced)
dynamically building sed scripts without knowing what's in x (and y) becomes tricky without some additional coding; instead it's typically easier to use parameter substitution to perform the replacements for us
in this particular case we can replace both sed calls with a single parameter substitution (which also eliminates the expensive overhead of two subprocesses for the $(echo ... | sed ...) call)
Making a few changes to OP's current code we can try:
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in "${string1}" "${string2}" "${string3}"; do
x="${str//[^a-zA-Z]/${y}}" # parameter substitution; replace everything *but* a letter with the contents of variable "y"
echo "${str} => ${x}" # display old and new strings
done
This generates:
One,two => One;two
three.four => three;four
five:six => five;six

How to return only integers from a variable in Shell Script and discard letters and leading zeros?

In my shell script there is a parameter that comes from certain systems and it gives an answer similar to this one: PAR0000008.
And I need to send only the last number of this parameter to another variable, ie VAR=8.
I used the command VAR=$( echo ${PAR} | cut -c 10 ) and it worked perfectly.
The problem is when the PAR parameter returns with numbers from two decimal places like PAR0000012. I need to discard the leading zeros and send only the number 12 to the variable, but I don't know how to do the logic in the Shell to discard all the characters to the left of the number.
Edit Using grep To Handle 0 As Part Of Final Number
Since you are using POSIX shell, making use of a utility like sed or grep (or cut) makes sense. grep is quite a bit more flexible in parsing the string allowing a REGEX match to handle the job. Say your variable v=PAR0312012 and you want the result r=312012. You can use a command substitution (e.g. $(...)) to parse the value assigning the result to r, e.g.
v=PAR0312012
r=$(echo $v | grep -Eo '[1-9].*$')
echo $r
The grep expression is:
-Eo - use Extended REGEX and only return matching portion of string,
[1-9].*$ - from the first character in [1-9] return the remainder of the string.
This will work for PAR0000012 or PAR0312012 (with result 312012).
Result
For PAR0312012
312012
Another Solution Using expr
If your variable can have zeros as part of the final number portion, then you must find the index where the first [1-9] character occurs, and then assign the substring beginning at that index to your result variable.
POSIX shell provides expr which provides a set of string parsing tools that can to this. The needed commands are:
expr index string charlist
and
expr substr string start end
Where start and end are the beginning and ending indexes to extract from the string. end just has to be long enough to encompass the entire substring, so you can just use the total length of your string, e.g.
v=PAR0312012
ndx=$(expr index "$v" "123456789")
r=$(expr substr "$v" "$ndx" 10)
echo $r
Result
312012
This will handle 0 anywhere after the first [1-9].
(note: the old expr ... isn't the fastest way of handling this, but if you are only concerned with a few tens of thousands of values, it will work fine. A billion numbers and another method will likely be needed)
This can be done easily using Parameter Expension.
var='PAR0000008'
echo "${var##*0}"
//prints 8
echo "${var##*[^1-9]}"
//prints 8
var="${var##*0}"
echo "$var"
//prints 8
var='PAR0000012'
echo "${var##*0}"
//prints 12
echo "${var##*[^1-9]}"
//prints 12
var="${var##*[^1-9]}"
echo "$var"
//prints 12

Deleting characters from permutations and character combinations

To delete particular characters from a combination list.
printf "%s\n" {a..c}{a..d} | sed 's/^cc//' | tr -s '\n'
I used the code above to delete a particular line of character from combination. Is there a way I can do it without sed, awk, grep or bc. Can I get it done with a single line of code in the script?
If you have stored your values in an array, e.g.:
arr=({a..c}{a..d})
Then you may filter your array elements with a string substitution:
printf -- '%s\n' "${arr[#]/%cc/}"
The syntax ${arr[#]/%cc/} tells to parse all elements from the array arr and substitute %cc with nothing. The % character indicates the beginning of the string, similar to ^ in sed, thus %cc means "every string beginning with cc".

How do I stop `read` with `IFS` from merging together whitespace characters? [duplicate]

This question already has answers here:
read in bash on whitespace-delimited file without empty fields collapsing
(6 answers)
Closed 2 years ago.
Take this piece of code that reads in data separated by |
DATA1="Andreas|Sweden|27"
DATA2="JohnDoe||30" # <---- UNKNOWN COUNTRY
while IFS="|" read -r NAME COUNTRY AGE; do
echo "NAME: $NAME";
echo "COUNTRY: $COUNTRY";
echo "AGE: $AGE";
done<<<"$DATA2"
OUTPUT:
NAME: JohnDoe
COUNTRY:
AGE: 30
It should work identically to this piece of code, where we are doing the exact same thing, just using \t as a separator instead of |
DATA1="Andreas Sweden 27"
DATA2="JohnDoe 30" # <---- THERE ARE TWO TABS HERE
while IFS=$'\t' read -r NAME COUNTRY AGE; do
echo "NAME: $NAME";
echo "COUNTRY: $COUNTRY";
echo "AGE: $AGE";
done<<<"$DATA2"
But it doesn't.
OUTPUT:
NAME: JohnDoe
COUNTRY: 30
AGE:
Bash, or read or IFS or some other part of the code is globbing together the whitespace when it isn't supposed to. Why is this happening, and how can I fix it?
bash is behaving exactly as it should. From the bash documentation:
The shell treats each character of IFS as a delimiter, and splits the results of the other expansions into words on these characters. If IFS is unset, or its value is exactly <space><tab><newline>, the default, then sequences of <space>, <tab>, and <newline> at the beginning and end of the results of the previous expansions are ignored, and any sequence of IFS characters not at the beginning or end serves to delimit words. If IFS has a value other than the default, then sequences of the whitespace characters space and tab are ignored at the beginning and end of the word, as long as the whitespace character is in the value of IFS (an IFS whitespace character). Any character in IFS that is not IFS whitespace, along with any adjacent IFS whitespace characters, delimits a field. A sequence of IFS whitespace characters is also treated as a delimiter.
To overcome this "feature", you could do something like the following:
#!/bin/bash
DATA1="Andreas Sweden 27"
DATA2="JohnDoe 30" # <---- THERE ARE TWO TABS HERE
echo "$DATA2" | sed 's/\t/;/g' |
while IFS=';' read -r NAME COUNTRY AGE; do
echo "NAME: $NAME"
echo "COUNTRY: $COUNTRY"
echo "AGE: $AGE"
done

bash 4: Generic access to substring (n) of string by arbitrary delimiter?

Let's assume I have the following string: x="number 1;number 2;number 3".
Access to the first substring is successfull via ${x%%";"*}, access to the last substring is via ${x##*";"}:
$ x="number 1;number 2;number 3"
$ echo "front : ${x%%";"*}" #front-most-part
number 1
$ echo "back : ${x##*";"}" #back-most-part
number 3
$
How do I access the middle part: (eg. number 2)?
Is there a better way to do this if I have (many...) more parts then just three?
In other words: Is there a generic way of accessing substring No. n of string yyy, delimited by string xxx where xxx is an arbitraty string/delimiter?
I have read How do I split a string on a delimiter in Bash?, but I specifically do not want to iterate over the string but rather directly access a given substring.
This specifically does not ask or a split into arrays, but into sub-strings.
With a fixed index:
x="number 1;number 2;number 3"
# Split input into fields by ';' and read the 2nd field into $f2
# Note the need for the *2nd* `unused`, otherwise f2 would
# receive the 2nd field *plus the remainder of the line*.
IFS=';' read -r unused f2 unused <<<"$x"
echo "$f2"
Generically, using an array:
x="number 1;number 2;number 3"
# Split input int fields by ';' and read all resulting fields
# into an *array* (-a).
IFS=';' read -r -a fields <<<"$x"
# Access the desired field.
ndx=1
echo "${fields[ndx]}"
Constraints:
Using IFS, the special variable specifying the Internal Field Separator characters, invariably means:
Only single, literal characters can act as field separators.
However, you can specify multiple characters, in which case any of them is treated as a separator.
The default separator characters are $' \t\n' - i.e., space, tab, and newline, and runs of them (multiple contiguious instances) are always considered a single separator; e.g., 'a b' has 2 fields - the multiple space count as a single separator.
By contrast, with any other character, characters in a run are considered separately, and thus separate empty fields; e.g., 'a;;b' has 3 fields - each ; is its own separator, so there's an empty field between ;;.
The read -r -a ... <<<... technique generally works well, as long as:
the input is single-line
you're not concerned about a trailing empty field getting discarded
If you need a fully generic, robust solution that addresses the issues above,
use the following variation, which is explained in #gniourf_gniourf answer here:
sep=';'
IFS="$sep" read -r -d '' -a fields < <(printf "%s${sep}\0" "$x")
Note the need to use -d '' to read multi-line input all at once, and the need to terminate the input with another separator instance to preserve a trailing empty field; the trailing \0 is needed to ensure that read's exit code is 0.
Don't use:
Create an array with a delimiter of ;:
x="number 1;number 2;number 3"
_IFS=$IFS; IFS=';'
arr=($x)
IFS=$_IFS
echo ${arr[0]} # number 1
echo ${arr[1]} # number 2
echo ${arr[2]} # number 3

Resources