How to extract last part of string in bash?

How to extract last part of string in bash? - string

I have this variable:
A="Some variable has value abc.123"
I need to extract this value i.e abc.123. Is this possible in bash?

Simplest is
echo "$A" | awk '{print $NF}'
Edit: explanation of how this works...
awk breaks the input into different fields, using whitespace as the separator by default. Hardcoding 5 in place of NF prints out the 5th field in the input:
echo "$A" | awk '{print $5}'
NF is a built-in awk variable that gives the total number of fields in the current record. The following returns the number 5 because there are 5 fields in the string "Some variable has value abc.123":
echo "$A" | awk '{print NF}'
Combining $ with NF outputs the last field in the string, no matter how many fields your string contains.

Yes; this:
A="Some variable has value abc.123"
echo "${A##* }"
will print this:
abc.123
(The ${parameter##word} notation is explained in §3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)

Some examples using parameter expansion
A="Some variable has value abc.123"
echo "${A##* }"
abc.123
Longest match on " " space
echo "${A% *}"
Some variable has value
Longest match on . dot
echo "${A%.*}"
Some variable has value abc
Shortest match on " " space
echo "${A%% *}"
some
Read more Shell-Parameter-Expansion

The documentation is a bit painful to read, so I've summarised it in a simpler way.
Note that the '*' needs to swap places with the ' ' depending on whether you use # or %. (The * is just a wildcard, so you may need to take off your "regex hat" while reading.)
${A% *} - remove shortest trailing * (strip the last word)
${A%% *} - remove longest trailing * (strip the last words)
${A#* } - remove shortest leading * (strip the first word)
${A##* } - remove longest leading * (strip the first words)
Of course a "word" here may contain any character that isn't a literal space.
You might commonly use this syntax to trim filenames:
${A##*/} removes all containing folders, if any, from the start of the path, e.g.
/usr/bin/git -> git
/usr/bin/ -> (empty string)
${A%/*} removes the last file/folder/trailing slash, if any, from the end:
/usr/bin/git -> /usr/bin
/usr/bin/ -> /usr/bin
${A%.*} removes the last extension, if any (just be wary of things like my.path/noext):
archive.tar.gz -> archive.tar

How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.:
B=$(echo "$A" | cut -d ' ' -f 5-)
This uses the cut command to slice out part of the line, using a simple space as the word delimiter.

As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring.
A="Some variable has value abc.123"
echo "$A" | rev | cut -d ' ' -f 1 | rev
# abc.123

More ways to do this:
(Run each of these commands in your terminal to test this live.)
For all answers below, start by typing this in your terminal:
A="Some variable has value abc.123"
The array example (#3 below) is a really useful pattern, and depending on what you are trying to do, sometimes the best.
1. with awk, as the main answer shows
echo "$A" | awk '{print $NF}'
2. with grep:
echo "$A" | grep -o '[^ ]*$'
the -o says to only retain the matching portion of the string
the [^ ] part says "don't match spaces"; ie: "not the space char"
the * means: "match 0 or more instances of the preceding match pattern (which is [^ ]), and the $ means "match the end of the line." So, this matches the last word after the last space through to the end of the line; ie: abc.123 in this case.
3. via regular bash "indexed" arrays and array indexing
Convert A to an array, with elements being separated by the default IFS (Internal Field Separator) char, which is space:
Option 1 (will "break in mysterious ways", as #tripleee put it in a comment here, if the string stored in the A variable contains certain special shell characters, so Option 2 below is recommended instead!):
# Capture space-separated words as separate elements in array A_array
A_array=($A)
Option 2 [RECOMMENDED!]. Use the read command, as I explain in my answer here, and as is recommended by the bash shellcheck static code analyzer tool for shell scripts, in ShellCheck rule SC2206, here.
# Capture space-separated words as separate elements in array A_array, using
# a "herestring".
# See my answer here: https://stackoverflow.com/a/71575442/4561887
IFS=" " read -r -d '' -a A_array <<< "$A"
Then, print only the last elment in the array:
# Print only the last element via bash array right-hand-side indexing syntax
echo "${A_array[-1]}" # last element only
Output:
abc.123
Going further:
What makes this pattern so useful too is that it allows you to easily do the opposite too!: obtain all words except the last one, like this:
array_len="${#A_array[#]}"
array_len_minus_one=$((array_len - 1))
echo "${A_array[#]:0:$array_len_minus_one}"
Output:
Some variable has value
For more on the ${array[#]:start:length} array slicing syntax above, see my answer here: Unix & Linux: Bash: slice of positional parameters, and for more info. on the bash "Arithmetic Expansion" syntax, see here:
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Arithmetic-Expansion
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Shell-Arithmetic

You can use a Bash regex:
A="Some variable has value abc.123"
[[ $A =~ [[:blank:]]([^[:blank:]]+)$ ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
Prints:
abc.123
That works with any [:blank:] delimiter in the current local (Usually [ \t]). If you want to be more specific:
A="Some variable has value abc.123"
pat='[ ]([^ ]+)$'
[[ $A =~ $pat ]] && echo "${BASH_REMATCH[1]}" || echo "no match"

echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'

Related

Expect: How to split a number with no delimiter? [duplicate]

Given a filename in the form someletters_12345_moreleters.ext, I want to extract the 5 digits and put them into a variable.
So to emphasize the point, I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters. I want to take the 5 digit number and put that into a variable.
I am very interested in the number of different ways that this can be accomplished.

You can use Parameter Expansion to do this.
If a is constant, the following parameter expansion performs substring extraction:
b=${a:12:5}
where 12 is the offset (zero-based) and 5 is the length
If the underscores around the digits are the only ones in the input, you can strip off the prefix and suffix (respectively) in two steps:
tmp=${a#*_} # remove prefix ending in "_"
b=${tmp%_*} # remove suffix starting with "_"
If there are other underscores, it's probably feasible anyway, albeit more tricky. If anyone knows how to perform both expansions in a single expression, I'd like to know too.
Both solutions presented are pure bash, with no process spawning involved, hence very fast.

Use cut:
echo 'someletters_12345_moreleters.ext' | cut -d'_' -f 2
More generic:
INPUT='someletters_12345_moreleters.ext'
SUBSTRING=$(echo $INPUT| cut -d'_' -f 2)
echo $SUBSTRING

just try to use cut -c startIndx-stopIndx

Generic solution where the number can be anywhere in the filename, using the first of such sequences:
number=$(echo $filename | egrep -o '[[:digit:]]{5}' | head -n1)
Another solution to extract exactly a part of a variable:
number=${filename:offset:length}
If your filename always have the format stuff_digits_... you can use awk:
number=$(echo $filename | awk -F _ '{ print $2 }')
Yet another solution to remove everything except digits, use
number=$(echo $filename | tr -cd '[[:digit:]]')

Here's how i'd do it:
FN=someletters_12345_moreleters.ext
[[ ${FN} =~ _([[:digit:]]{5})_ ]] && NUM=${BASH_REMATCH[1]}
Explanation:
Bash-specific:
[[ ]] indicates a conditional expression
=~ indicates the condition is a regular expression
&& chains the commands if the prior command was successful
Regular Expressions (RE): _([[:digit:]]{5})_
_ are literals to demarcate/anchor matching boundaries for the string being matched
() create a capture group
[[:digit:]] is a character class, i think it speaks for itself
{5} means exactly five of the prior character, class (as in this example), or group must match
In english, you can think of it behaving like this: the FN string is iterated character by character until we see an _ at which point the capture group is opened and we attempt to match five digits. If that matching is successful to this point, the capture group saves the five digits traversed. If the next character is an _, the condition is successful, the capture group is made available in BASH_REMATCH, and the next NUM= statement can execute. If any part of the matching fails, saved details are disposed of and character by character processing continues after the _. e.g. if FN where _1 _12 _123 _1234 _12345_, there would be four false starts before it found a match.

In case someone wants more rigorous information, you can also search it in man bash like this
$ man bash [press return key]
/substring [press return key]
[press "n" key]
[press "n" key]
[press "n" key]
[press "n" key]
Result:
${parameter:offset}
${parameter:offset:length}
Substring Expansion. Expands to up to length characters of
parameter starting at the character specified by offset. If
length is omitted, expands to the substring of parameter start‐
ing at the character specified by offset. length and offset are
arithmetic expressions (see ARITHMETIC EVALUATION below). If
offset evaluates to a number less than zero, the value is used
as an offset from the end of the value of parameter. Arithmetic
expressions starting with a - must be separated by whitespace
from the preceding : to be distinguished from the Use Default
Values expansion. If length evaluates to a number less than
zero, and parameter is not # and not an indexed or associative
array, it is interpreted as an offset from the end of the value
of parameter rather than a number of characters, and the expan‐
sion is the characters between the two offsets. If parameter is
#, the result is length positional parameters beginning at off‐
set. If parameter is an indexed array name subscripted by # or
*, the result is the length members of the array beginning with
${parameter[offset]}. A negative offset is taken relative to
one greater than the maximum index of the specified array. Sub‐
string expansion applied to an associative array produces unde‐
fined results. Note that a negative offset must be separated
from the colon by at least one space to avoid being confused
with the :- expansion. Substring indexing is zero-based unless
the positional parameters are used, in which case the indexing
starts at 1 by default. If offset is 0, and the positional
parameters are used, $0 is prefixed to the list.

I'm surprised this pure bash solution didn't come up:
a="someletters_12345_moreleters.ext"
IFS="_"
set $a
echo $2
# prints 12345
You probably want to reset IFS to what value it was before, or unset IFS afterwards!

Building on jor's answer (which doesn't work for me):
substring=$(expr "$filename" : '.*_\([^_]*\)_.*')

Following the requirements
I have a filename with x number of characters then a five digit
sequence surrounded by a single underscore on either side then another
set of x number of characters. I want to take the 5 digit number and
put that into a variable.
I found some grep ways that may be useful:
$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]+"
12345
or better
$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]{5}"
12345
And then with -Po syntax:
$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d+'
12345
Or if you want to make it fit exactly 5 characters:
$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d{5}'
12345
Finally, to make it be stored in a variable it is just need to use the var=$(command) syntax.

If we focus in the concept of:
"A run of (one or several) digits"
We could use several external tools to extract the numbers.
We could quite easily erase all other characters, either sed or tr:
name='someletters_12345_moreleters.ext'
echo $name | sed 's/[^0-9]*//g' # 12345
echo $name | tr -c -d 0-9 # 12345
But if $name contains several runs of numbers, the above will fail:
If "name=someletters_12345_moreleters_323_end.ext", then:
echo $name | sed 's/[^0-9]*//g' # 12345323
echo $name | tr -c -d 0-9 # 12345323
We need to use regular expresions (regex).
To select only the first run (12345 not 323) in sed and perl:
echo $name | sed 's/[^0-9]*\([0-9]\{1,\}\).*$/\1/'
perl -e 'my $name='$name';my ($num)=$name=~/(\d+)/;print "$num\n";'
But we could as well do it directly in bash(1) :
regex=[^0-9]*([0-9]{1,}).*$; \
[[ $name =~ $regex ]] && echo ${BASH_REMATCH[1]}
This allows us to extract the FIRST run of digits of any length
surrounded by any other text/characters.
Note: regex=[^0-9]*([0-9]{5,5}).*$; will match only exactly 5 digit runs. :-)
(1): faster than calling an external tool for each short texts. Not faster than doing all processing inside sed or awk for large files.

Without any sub-processes you can:
shopt -s extglob
front=${input%%_+([a-zA-Z]).*}
digits=${front##+([a-zA-Z])_}
A very small variant of this will also work in ksh93.

My answer will have more control on what you want out of your string. Here is the code on how you can extract 12345 out of your string
str="someletters_12345_moreleters.ext"
str=${str#*_}
str=${str%_more*}
echo $str
This will be more efficient if you want to extract something that has any chars like abc or any special characters like _ or -. For example: If your string is like this and you want everything that is after someletters_ and before _moreleters.ext :
str="someletters_123-45-24a&13b-1_moreleters.ext"
With my code you can mention what exactly you want.
Explanation:
#* It will remove the preceding string including the matching key. Here the key we mentioned is _
% It will remove the following string including the matching key. Here the key we mentioned is '_more*'
Do some experiments yourself and you would find this interesting.

Here's a prefix-suffix solution (similar to the solutions given by JB and Darron) that matches the first block of digits and does not depend on the surrounding underscores:
str='someletters_12345_morele34ters.ext'
s1="${str#"${str%%[[:digit:]]*}"}" # strip off non-digit prefix from str
s2="${s1%%[^[:digit:]]*}" # strip off non-digit suffix from s1
echo "$s2" # 12345

shell cut - print specific range of characters or given part from a string
#method1) using bash
str=2020-08-08T07:40:00.000Z
echo ${str:11:8}
#method2) using cut
str=2020-08-08T07:40:00.000Z
cut -c12-19 <<< $str
#method3) when working with awk
str=2020-08-08T07:40:00.000Z
awk '{time=gensub(/.{11}(.{8}).*/,"\\1","g",$1); print time}' <<< $str

I love sed's capability to deal with regex groups:
> var="someletters_12345_moreletters.ext"
> digits=$( echo "$var" | sed "s/.*_\([0-9]\+\).*/\1/p" -n )
> echo $digits
12345
A slightly more general option would be not to assume that you have an underscore _ marking the start of your digits sequence, hence for instance stripping off all non-numbers you get before your sequence: s/[^0-9]\+\([0-9]\+\).*/\1/p.
> man sed | grep s/regexp/replacement -A 2
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The replacement may contain the special character & to
refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.
More on this, in case you're not too confident with regexps:
s is for _s_ubstitute
[0-9]+ matches 1+ digits
\1 links to the group n.1 of the regex output (group 0 is the whole match, group 1 is the match within parentheses in this case)
p flag is for _p_rinting
All escapes \ are there to make sed's regexp processing work.

Given test.txt is a file containing "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
cut -b19-20 test.txt > test1.txt # This will extract chars 19 & 20 "ST"
while read -r; do;
> x=$REPLY
> done < test1.txt
echo $x
ST

similar to substr('abcdefg', 2-1, 3) in php:
echo 'abcdefg'|tail -c +2|head -c 3

May be this could help you to get desired output
Code :
your_number=$(echo "someletters_12345_moreleters.ext" | grep -E -o '[0-9]{5}')
echo $your_number
Output :
12345

Ok, here goes pure Parameter Substitution with an empty string. Caveat is that I have defined someletters and moreletters as only characters. If they are alphanumeric, this will not work as it is.
filename=someletters_12345_moreletters.ext
substring=${filename//#(+([a-z])_|_+([a-z]).*)}
echo $substring
12345

There's also the bash builtin 'expr' command:
INPUT="someletters_12345_moreleters.ext"
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)_.*' `
echo $SUBSTRING

A bash solution:
IFS="_" read -r x digs x <<<'someletters_12345_moreleters.ext'
This will clobber a variable called x. The var x could be changed to the var _.
input='someletters_12345_moreleters.ext'
IFS="_" read -r _ digs _ <<<"$input"

Lots of outdated solutions to this problem that require pipes and subshells.
Since version 3 of bash (released in 2004), it has a built-in regular expression comparison operator =~.
input="someletters_12345_moreleters.ext"
# match: underscore followed by 1 or more digits followed by underscore
[[ $input =~ _([0-9]+)_ ]]
echo ${BASH_REMATCH[1]}
Output:
12345
Note, if you're not very proficient in writing RegExp's I recommend reading Mastering Regular Expressions.
If you just need to figure out how to get your RegExp to work, and it's not matching the way you think, try the online GUI at RegEx101.com and set your "Flavor" to "PCRE" so you get the POSIX style character classes like [[:digit:]] that bash uses.

Inklusive end, similar to JS and Java implementations. Remove +1 if you do not desire this.
function substring() {
local str="$1" start="${2}" end="${3}"
if [[ "$start" == "" ]]; then start="0"; fi
if [[ "$end" == "" ]]; then end="${#str}"; fi
local length="((${end}-${start}+1))"
echo "${str:${start}:${length}}"
}
Example:
substring 01234 0
01234
substring 012345 0
012345
substring 012345 0 0
0
substring 012345 1 1
1
substring 012345 1 2
12
substring 012345 0 1
01
substring 012345 0 2
012
substring 012345 0 3
0123
substring 012345 0 4
01234
substring 012345 0 5
012345
More example calls:
substring 012345 0
012345
substring 012345 1
12345
substring 012345 2
2345
substring 012345 3
345
substring 012345 4
45
substring 012345 5
5
substring 012345 6
substring 012345 3 5
345
substring 012345 3 4
34
substring 012345 2 4
234
substring 012345 1 3
123

An easy way to use sed replace:
result=$(echo "someletters_12345_moreleters.ext" | sed 's/.*_\(.*\)_.*/\1/g')
echo $result

A little late, but I just ran across this problem and found the following:
host:/tmp$ asd=someletters_12345_moreleters.ext
host:/tmp$ echo `expr $asd : '.*_\(.*\)_'`
12345
host:/tmp$
I used it to get millisecond resolution on an embedded system that does not have %N for date:
set `grep "now at" /proc/timer_list`
nano=$3
fraction=`expr $nano : '.*\(...\)......'`
$debug nano is $nano, fraction is $fraction

Here is a substring.sh file
Usage
`substring.sh $TEXT 2 3` # characters 2-3
`substring.sh $TEXT 2` # characters 2 and after
substring.sh follows this line
#echo "starting substring"
chars=$1
start=$(($2))
end=$3
i=0
o=""
if [[ -z $end ]]; then
end=`echo "$chars " | wc -c`
else
end=$((end))
fi
#echo "length is " $e
a=`echo $chars | sed 's/\(.\)/\1 /g'`
#echo "a is " $a
for c in $a
do
#echo "substring" $i $e $c
if [[ i -lt $start ]]; then
: # DO Nothing
elif [[ i -gt $end ]]; then
break;
else
o="$o$c"
fi
i=$(($i+1))
done
#echo substring returning $o
echo $o

How can I display unique words contained in a Bash string?

I have a string that has duplicate words. I would like to display only the unique words. The string is:
variable="alpha bravo charlie alpha delta echo charlie"
I know several tools that can do this together. This is what I figured out:
echo $variable | tr " " "\n" | sort -u | tr "\n" " "
What is a more effective way to do this?

Use a Bash Substitution Expansion
The following shell parameter expansion will substitute spaces with newlines, and then pass the results into the sort utility to return only the unique words.
$ echo -e "${variable// /\\n}" | sort -u
alpha
bravo
charlie
delta
echo
This has the side-effect of sorting your words, as the sort and uniq utilities both require input to be sorted in order to detect duplicates. If that's not what you want, I also posted a Ruby solution that preserves the original word order.
Rejoining Words
If, as one commenter pointed out, you're trying to reassemble your unique words back into a single line, you can use command substitution to do this. For example:
$ echo $(echo -e "${variable// /\\n}" | sort -u)
alpha bravo charlie delta echo
The lack of quotes around the command substitution are intentional. If you quote it, the newlines will be preserved because Bash won't do word-splitting. Unquoted, the shell will return the results as a single line, however unintuitive that may seem.

You may use xargs:
echo "$variable" | xargs -n 1 | sort -u | xargs

Note: This solution assumes that all unique words should be output in the order they're encountered in the input. By contrast, the OP's own solution attempt outputs a sorted list of unique words.
A simple Awk-only solution (POSIX-compliant) that is efficient by avoiding a pipeline (which invariably involves subshells).
awk -v RS=' ' '{ if (!seen[$1]++) { printf "%s%s",sep,$1; sep=" " } }' <<<"$variable"
# The above prints without a trailing \n, as in the OP's own solution.
# To add a trailing newline, append `END { print }` to the end
# of the Awk script.
Note how $variable is double-quoted to prevent it from accidental shell expansions, notably pathname expansion (globbing), and how it is provided to Awk via a here-string (<<<).
-v RS=' ' tells Awk to split the input into records by a single space.
Note that the last word will have the input line's trailing newline included, which is why we don't use $0 - the entire record - but $1, the record's first field, which has the newline stripped due to Awk's default field-splitting behavior.
seen[$1]++ is a common Awk idiom that either creates an entry for $1, the input word, in associative array seen, if it doesn't exist yet, or increments its occurrence count.
!seen[$0]++ therefore only returns true for the first occurrence of a given word (where seen[$0] is implicitly zero/the empty string; the ++ is a post-increment, and therefore doesn't take effect until after the condition is evaluated)
{printf "%s%s",sep,$1; sep=" "} prints the word at hand $1, preceded by separator sep, which is implicitly the empty string for the first word, but a single space for subsequent words, due to setting sep to " " immediately after.
Here's a more flexible variant that handles any run of whitespace between input words; it works with GNU Awk and Mawk[1]:
awk -v RS='[[:space:]]+' '{if (!seen[$0]++){printf "%s%s",sep,$0; sep=" "}}' <<<"$variable"
-v RS='[[:space:]]s+' tells Awk to split the input into records by any mix of spaces, tabs, and newlines.
[1] Unfortunately, BSD/OSX Awk (in strict compliance with the POSIX spec), doesn't support using regular expressions or even multi-character literals as RS, the input record separator.

Preserve Input Order with a Ruby One-Liner
I posted a Bash-specific answer already, but if you want to return only unique words while preserving the word order of the original string, then you can use the following Ruby one-liner:
$ echo "$variable" | ruby -ne 'puts $_.split.uniq'
alpha
bravo
charlie
delta
echo
This will split the input string on whitespace, and then return unique elements from the resulting array.
Unlike the sort or uniq utilities, Ruby doesn't need the words to be sorted to detect duplicates. This may be a better solution if you don't want your results to be sorted, although given your input sample it makes no practical difference for the posted example.
Rejoining Words
If, as one commenter pointed out, you're then trying to reassemble the words back into a single line after deduplication, you can do that too. For that, we just append the Array#join method:
$ echo "$variable" | ruby -ne 'puts $_.split.uniq.join(" ")'
alpha bravo charlie delta echo

You can use awk:
$ echo "$variable" | awk '{for(i=1;i<=NF;i++){if (!seen[$i]++) printf $i" "}}'
alpha bravo charlie delta echo
If you do not want the trailing space and want a trailing CR, you can do:
$ echo "$variable" | awk 'BEGIN{j=""} {for(i=1;i<=NF;i++){if (!seen[$i]++)j=j==""?j=$i:j=j" "$i}} END{print j}'
alpha bravo charlie delta echo

Using associative arrays in BASH 4+ you can simplify this:
variable="alpha bravo charlie alpha delta echo charlie"
# declare an associative array
declare -A unq
# read sentence into an indexed array
read -ra arr <<< "$variable"
# iterate each word and populate associative array with word as key
for w in "${arr[#]}"; do
unq["$w"]=1
done
# print unique results
printf "%s\n" "${!unq[#]}"
delta
bravo
echo
alpha
charlie
## if you want results in same order as original string
for w in "${arr[#]}"; do
[[ ${unq["$w"]} ]] && echo "$w" && unset unq["$w"]
done
alpha
bravo
charlie
delta
echo

pure, ugly bash:
for x in $vaviable; do
if [ "$(eval echo $(echo \$un__$x))" = "" ]; then
echo -n $x
eval un__$x=1
__usv="$__usv un__$x"
fi
done
unset $__usv

Bash regexp to find part of string

I have a string like setSuperValue('sdfsdfd') and I need to get the 'sdfsdfd' value from this line. What is way to do this?
First I find line by setSuperValue and then get only string with my target content - setSuperValue('sdfsdfd'). How do I build a regexp to get sdfsdfd from this line?

This should help you
grep setSuperValue myfile.txt | grep -o "'. *'" | tr -d "'"
The grep -o will return all text that start with a single ' and ends with another ', including both quotes. Then use tr to get rid of the quotes.
You could also use cut:
grep setSuperValue myfile.txt | cut -d"'" -f2
Or awk:
grep setSuperValue myfile.txt | awk -F "'" '{print $2}'
This will split the line where the single quotes are and return the second value, that is what you are looking for.

Generally, to locate a string in multiple lines of data, external utilities will be much faster than looping over lines in Bash.
In your specific case, a single sed command will do what you want:
sed -n -r "s/^.*setSuperValue\('([^']+)'\).*$/\1/p" file
Extended (-r) regular expression ^.*setSuperValue\('([^']+)'\).*$ matches any line containing setSuperValue('...') as a whole, captures whatever ... is in capture group \1, replaces the input line with that, and prints p the result.
Due to option -n, nothing else is printed.
Move the opening and closing ' inside (...) to include them in the captured value.
Note: If the input file contains multiple setSuperValue('...') lines, the command will print every match; either way, the command will process all lines.
To only print the 1st match and stop processing immediately after, modify the command as follows:
sed -n -r "/^.*setSuperValue\('([^']+)'\).*$/ {s//\1/;p;q}" file
/.../ only matches lines containing setSuperValue('...'), causing the following {...} to be executed only for matching lines.
s// - i.e., not specifying a regex - implicitly performs substitution based on the same regex that matched the line at hand; p prints the result, and q quits processing altogether, meaning that processing stops once the fist match was found.
If you have already located a line of interest through other methods and are looking for a pure Bash method of extracting a substring based on a regex, use =~, Bash's regex-matching operator, which supports extended regular expressions and capture groups through the special ${BASH_REMATCH[#]} array variable:
$ sampleLine="... setSuperValue('sdfsdfd') ..."
$ [[ $sampleLine =~ "setSuperValue('"([^\']+)"')" ]] && echo "${BASH_REMATCH[1]}"
sdfsdfd
Note the careful quoting of the parts of the regex that should be taken literally, and how ${BASH_REMATCH[1]} refers to the first (and only) captured group.

You can parse the value from the line, using parameter expansion/substring removal without relying on any external tools:
#!/bin/bash
while read -r line; do
value=$(expr "$line" : ".*setSuperValue('\(.*\)')")
if [ "x$value" != "x" ]; then
printf "value : %s\n" "$value"
fi
done <"$1"
Test Input
$ cat dat/supervalue.txt
setSuperValue('sdfsdfd')
something else
setSuperValue('sdfsdfd')
something else
setSuperValue('sdfsdfd')
something else
Example Output
$ bash parsevalue.sh dat/supervalue.txt
value : sdfsdfd
value : sdfsdfd
value : sdfsdfd

Unix - how to use cut -d on one word

I have a string with two words but sometimes it may contain only one word and i need to get both words and if the second one is empty i want an empty string.
I am using the following:
STRING1=`echo $STRING|cut -d' ' -f1`
STRING2=`echo $STRING|cut -d' ' -f2`
When STRING is only one word both strings are equal but I need the second screen to be empty.

Your problem is (from cut(1))
`-f FIELD-LIST'
`--fields=FIELD-LIST'
Select for printing only the fields listed in FIELD-LIST. Fields
are separated by a TAB character by default. Also print any line
that contains no delimiter character, unless the
`--only-delimited' (`-s') option is specified.
You could specify -s when extracing the second word, or use
echo " $STRING" | cut -d' ' -f3
to extract the second word (note the fake separator in front of $STRING).

The shell has built-in functionality for this.
echo "First word: ${STRING%% *}"
echo "Last word: ${STRING##* }"
The double ## or %% is not compatible with older shells; they only had a single-separator variant, which trims the shortest possible match instead of the longest. (You can simulate longest suffix by extracting the shortest prefix, then trim everything else, but this takes two trims.)
Mnemonic: # is to the left of $ on the keyboard, % is to the right.
For your actual problem, I would add a simple check to see if the first extraction extracted the whole string; if so, the second should be left empty.
STRING1="${STRING%% *}"
case $STRING1 in
"$STRING" ) STRING2="" ;;
* ) STRING2="${STRING#$STRING1 }" ;;
esac
As an aside, there's also this:
set $STRING
STRING1=$1
STRING2=$2

Why not just use read:
STR='word1 word2'
read string1 string2 <<< "$STR"
echo "$string1"
word1
echo "$string2"
word2
Now the missing 2nd word:
STR='word1'
read string1 string2 <<< "$STR"
echo "$string1"
word1
echo "$string2" | cat -vte
$

Extract substring in Bash

Given a filename in the form someletters_12345_moreleters.ext, I want to extract the 5 digits and put them into a variable.
So to emphasize the point, I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters. I want to take the 5 digit number and put that into a variable.
I am very interested in the number of different ways that this can be accomplished.

You can use Parameter Expansion to do this.
If a is constant, the following parameter expansion performs substring extraction:
b=${a:12:5}
where 12 is the offset (zero-based) and 5 is the length
If the underscores around the digits are the only ones in the input, you can strip off the prefix and suffix (respectively) in two steps:
tmp=${a#*_} # remove prefix ending in "_"
b=${tmp%_*} # remove suffix starting with "_"
If there are other underscores, it's probably feasible anyway, albeit more tricky. If anyone knows how to perform both expansions in a single expression, I'd like to know too.
Both solutions presented are pure bash, with no process spawning involved, hence very fast.

Use cut:
echo 'someletters_12345_moreleters.ext' | cut -d'_' -f 2
More generic:
INPUT='someletters_12345_moreleters.ext'
SUBSTRING=$(echo $INPUT| cut -d'_' -f 2)
echo $SUBSTRING

just try to use cut -c startIndx-stopIndx

Generic solution where the number can be anywhere in the filename, using the first of such sequences:
number=$(echo $filename | egrep -o '[[:digit:]]{5}' | head -n1)
Another solution to extract exactly a part of a variable:
number=${filename:offset:length}
If your filename always have the format stuff_digits_... you can use awk:
number=$(echo $filename | awk -F _ '{ print $2 }')
Yet another solution to remove everything except digits, use
number=$(echo $filename | tr -cd '[[:digit:]]')

Here's how i'd do it:
FN=someletters_12345_moreleters.ext
[[ ${FN} =~ _([[:digit:]]{5})_ ]] && NUM=${BASH_REMATCH[1]}
Explanation:
Bash-specific:
[[ ]] indicates a conditional expression
=~ indicates the condition is a regular expression
&& chains the commands if the prior command was successful
Regular Expressions (RE): _([[:digit:]]{5})_
_ are literals to demarcate/anchor matching boundaries for the string being matched
() create a capture group
[[:digit:]] is a character class, i think it speaks for itself
{5} means exactly five of the prior character, class (as in this example), or group must match
In english, you can think of it behaving like this: the FN string is iterated character by character until we see an _ at which point the capture group is opened and we attempt to match five digits. If that matching is successful to this point, the capture group saves the five digits traversed. If the next character is an _, the condition is successful, the capture group is made available in BASH_REMATCH, and the next NUM= statement can execute. If any part of the matching fails, saved details are disposed of and character by character processing continues after the _. e.g. if FN where _1 _12 _123 _1234 _12345_, there would be four false starts before it found a match.

In case someone wants more rigorous information, you can also search it in man bash like this
$ man bash [press return key]
/substring [press return key]
[press "n" key]
[press "n" key]
[press "n" key]
[press "n" key]
Result:
${parameter:offset}
${parameter:offset:length}
Substring Expansion. Expands to up to length characters of
parameter starting at the character specified by offset. If
length is omitted, expands to the substring of parameter start‐
ing at the character specified by offset. length and offset are
arithmetic expressions (see ARITHMETIC EVALUATION below). If
offset evaluates to a number less than zero, the value is used
as an offset from the end of the value of parameter. Arithmetic
expressions starting with a - must be separated by whitespace
from the preceding : to be distinguished from the Use Default
Values expansion. If length evaluates to a number less than
zero, and parameter is not # and not an indexed or associative
array, it is interpreted as an offset from the end of the value
of parameter rather than a number of characters, and the expan‐
sion is the characters between the two offsets. If parameter is
#, the result is length positional parameters beginning at off‐
set. If parameter is an indexed array name subscripted by # or
*, the result is the length members of the array beginning with
${parameter[offset]}. A negative offset is taken relative to
one greater than the maximum index of the specified array. Sub‐
string expansion applied to an associative array produces unde‐
fined results. Note that a negative offset must be separated
from the colon by at least one space to avoid being confused
with the :- expansion. Substring indexing is zero-based unless
the positional parameters are used, in which case the indexing
starts at 1 by default. If offset is 0, and the positional
parameters are used, $0 is prefixed to the list.

I'm surprised this pure bash solution didn't come up:
a="someletters_12345_moreleters.ext"
IFS="_"
set $a
echo $2
# prints 12345
You probably want to reset IFS to what value it was before, or unset IFS afterwards!

Building on jor's answer (which doesn't work for me):
substring=$(expr "$filename" : '.*_\([^_]*\)_.*')

Following the requirements
I have a filename with x number of characters then a five digit
sequence surrounded by a single underscore on either side then another
set of x number of characters. I want to take the 5 digit number and
put that into a variable.
I found some grep ways that may be useful:
$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]+"
12345
or better
$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]{5}"
12345
And then with -Po syntax:
$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d+'
12345
Or if you want to make it fit exactly 5 characters:
$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d{5}'
12345
Finally, to make it be stored in a variable it is just need to use the var=$(command) syntax.

If we focus in the concept of:
"A run of (one or several) digits"
We could use several external tools to extract the numbers.
We could quite easily erase all other characters, either sed or tr:
name='someletters_12345_moreleters.ext'
echo $name | sed 's/[^0-9]*//g' # 12345
echo $name | tr -c -d 0-9 # 12345
But if $name contains several runs of numbers, the above will fail:
If "name=someletters_12345_moreleters_323_end.ext", then:
echo $name | sed 's/[^0-9]*//g' # 12345323
echo $name | tr -c -d 0-9 # 12345323
We need to use regular expresions (regex).
To select only the first run (12345 not 323) in sed and perl:
echo $name | sed 's/[^0-9]*\([0-9]\{1,\}\).*$/\1/'
perl -e 'my $name='$name';my ($num)=$name=~/(\d+)/;print "$num\n";'
But we could as well do it directly in bash(1) :
regex=[^0-9]*([0-9]{1,}).*$; \
[[ $name =~ $regex ]] && echo ${BASH_REMATCH[1]}
This allows us to extract the FIRST run of digits of any length
surrounded by any other text/characters.
Note: regex=[^0-9]*([0-9]{5,5}).*$; will match only exactly 5 digit runs. :-)
(1): faster than calling an external tool for each short texts. Not faster than doing all processing inside sed or awk for large files.

Without any sub-processes you can:
shopt -s extglob
front=${input%%_+([a-zA-Z]).*}
digits=${front##+([a-zA-Z])_}
A very small variant of this will also work in ksh93.

My answer will have more control on what you want out of your string. Here is the code on how you can extract 12345 out of your string
str="someletters_12345_moreleters.ext"
str=${str#*_}
str=${str%_more*}
echo $str
This will be more efficient if you want to extract something that has any chars like abc or any special characters like _ or -. For example: If your string is like this and you want everything that is after someletters_ and before _moreleters.ext :
str="someletters_123-45-24a&13b-1_moreleters.ext"
With my code you can mention what exactly you want.
Explanation:
#* It will remove the preceding string including the matching key. Here the key we mentioned is _
% It will remove the following string including the matching key. Here the key we mentioned is '_more*'
Do some experiments yourself and you would find this interesting.

Here's a prefix-suffix solution (similar to the solutions given by JB and Darron) that matches the first block of digits and does not depend on the surrounding underscores:
str='someletters_12345_morele34ters.ext'
s1="${str#"${str%%[[:digit:]]*}"}" # strip off non-digit prefix from str
s2="${s1%%[^[:digit:]]*}" # strip off non-digit suffix from s1
echo "$s2" # 12345

shell cut - print specific range of characters or given part from a string
#method1) using bash
str=2020-08-08T07:40:00.000Z
echo ${str:11:8}
#method2) using cut
str=2020-08-08T07:40:00.000Z
cut -c12-19 <<< $str
#method3) when working with awk
str=2020-08-08T07:40:00.000Z
awk '{time=gensub(/.{11}(.{8}).*/,"\\1","g",$1); print time}' <<< $str

I love sed's capability to deal with regex groups:
> var="someletters_12345_moreletters.ext"
> digits=$( echo "$var" | sed "s/.*_\([0-9]\+\).*/\1/p" -n )
> echo $digits
12345
A slightly more general option would be not to assume that you have an underscore _ marking the start of your digits sequence, hence for instance stripping off all non-numbers you get before your sequence: s/[^0-9]\+\([0-9]\+\).*/\1/p.
> man sed | grep s/regexp/replacement -A 2
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The replacement may contain the special character & to
refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.
More on this, in case you're not too confident with regexps:
s is for _s_ubstitute
[0-9]+ matches 1+ digits
\1 links to the group n.1 of the regex output (group 0 is the whole match, group 1 is the match within parentheses in this case)
p flag is for _p_rinting
All escapes \ are there to make sed's regexp processing work.

Given test.txt is a file containing "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
cut -b19-20 test.txt > test1.txt # This will extract chars 19 & 20 "ST"
while read -r; do;
> x=$REPLY
> done < test1.txt
echo $x
ST

similar to substr('abcdefg', 2-1, 3) in php:
echo 'abcdefg'|tail -c +2|head -c 3

May be this could help you to get desired output
Code :
your_number=$(echo "someletters_12345_moreleters.ext" | grep -E -o '[0-9]{5}')
echo $your_number
Output :
12345

Ok, here goes pure Parameter Substitution with an empty string. Caveat is that I have defined someletters and moreletters as only characters. If they are alphanumeric, this will not work as it is.
filename=someletters_12345_moreletters.ext
substring=${filename//#(+([a-z])_|_+([a-z]).*)}
echo $substring
12345

There's also the bash builtin 'expr' command:
INPUT="someletters_12345_moreleters.ext"
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)_.*' `
echo $SUBSTRING

A bash solution:
IFS="_" read -r x digs x <<<'someletters_12345_moreleters.ext'
This will clobber a variable called x. The var x could be changed to the var _.
input='someletters_12345_moreleters.ext'
IFS="_" read -r _ digs _ <<<"$input"

Lots of outdated solutions to this problem that require pipes and subshells.
Since version 3 of bash (released in 2004), it has a built-in regular expression comparison operator =~.
input="someletters_12345_moreleters.ext"
# match: underscore followed by 1 or more digits followed by underscore
[[ $input =~ _([0-9]+)_ ]]
echo ${BASH_REMATCH[1]}
Output:
12345
Note, if you're not very proficient in writing RegExp's I recommend reading Mastering Regular Expressions.
If you just need to figure out how to get your RegExp to work, and it's not matching the way you think, try the online GUI at RegEx101.com and set your "Flavor" to "PCRE" so you get the POSIX style character classes like [[:digit:]] that bash uses.

Inklusive end, similar to JS and Java implementations. Remove +1 if you do not desire this.
function substring() {
local str="$1" start="${2}" end="${3}"
if [[ "$start" == "" ]]; then start="0"; fi
if [[ "$end" == "" ]]; then end="${#str}"; fi
local length="((${end}-${start}+1))"
echo "${str:${start}:${length}}"
}
Example:
substring 01234 0
01234
substring 012345 0
012345
substring 012345 0 0
0
substring 012345 1 1
1
substring 012345 1 2
12
substring 012345 0 1
01
substring 012345 0 2
012
substring 012345 0 3
0123
substring 012345 0 4
01234
substring 012345 0 5
012345
More example calls:
substring 012345 0
012345
substring 012345 1
12345
substring 012345 2
2345
substring 012345 3
345
substring 012345 4
45
substring 012345 5
5
substring 012345 6
substring 012345 3 5
345
substring 012345 3 4
34
substring 012345 2 4
234
substring 012345 1 3
123

An easy way to use sed replace:
result=$(echo "someletters_12345_moreleters.ext" | sed 's/.*_\(.*\)_.*/\1/g')
echo $result

A little late, but I just ran across this problem and found the following:
host:/tmp$ asd=someletters_12345_moreleters.ext
host:/tmp$ echo `expr $asd : '.*_\(.*\)_'`
12345
host:/tmp$
I used it to get millisecond resolution on an embedded system that does not have %N for date:
set `grep "now at" /proc/timer_list`
nano=$3
fraction=`expr $nano : '.*\(...\)......'`
$debug nano is $nano, fraction is $fraction

Here is a substring.sh file
Usage
`substring.sh $TEXT 2 3` # characters 2-3
`substring.sh $TEXT 2` # characters 2 and after
substring.sh follows this line
#echo "starting substring"
chars=$1
start=$(($2))
end=$3
i=0
o=""
if [[ -z $end ]]; then
end=`echo "$chars " | wc -c`
else
end=$((end))
fi
#echo "length is " $e
a=`echo $chars | sed 's/\(.\)/\1 /g'`
#echo "a is " $a
for c in $a
do
#echo "substring" $i $e $c
if [[ i -lt $start ]]; then
: # DO Nothing
elif [[ i -gt $end ]]; then
break;
else
o="$o$c"
fi
i=$(($i+1))
done
#echo substring returning $o
echo $o

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract last part of string in bash? - string

I have this variable: A="Some variable has value abc.123" I need to extract this value i.e abc.123. Is this possible in bash?

Yes; this: A="Some variable has value abc.123" echo "${A##* }" will print this: abc.123 (The ${parameter##word} notation is explained in §3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)

How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.: B=$(echo "$A" | cut -d ' ' -f 5-) This uses the cut command to slice out part of the line, using a simple space as the word delimiter.

As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring. A="Some variable has value abc.123" echo "$A" | rev | cut -d ' ' -f 1 | rev # abc.123

echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'

Related

Expect: How to split a number with no delimiter? [duplicate]

How can I display unique words contained in a Bash string?

Bash regexp to find part of string

Unix - how to use cut -d on one word

Extract substring in Bash

Categories

Resources