How to extract numbers from a string? - linux

I have string contains a path
string="toto.titi.12.tata.2.abc.def"
I want to extract only the numbers from this string.
To extract the first number:
tmp="${string#toto.titi.*.}"
num1="${tmp%.tata*}"
To extract the second number:
tmp="${string#toto.titi.*.tata.*.}"
num2="${tmp%.abc.def}"
So to extract a parameter I have to do it in 2 steps. How to extract a number with one step?

You can use tr to delete all of the non-digit characters, like so:
echo toto.titi.12.tata.2.abc.def | tr -d -c 0-9

To extract all the individual numbers and print one number word per line pipe through -
tr '\n' ' ' | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | sed 's/ /\n/g'
Breakdown:
Replaces all line breaks with spaces: tr '\n' ' '
Replaces all non numbers with spaces: sed -e 's/[^0-9]/ /g'
Remove leading white space: -e 's/^ *//g'
Remove trailing white space: -e 's/ *$//g'
Squeeze spaces in sequence to 1 space: tr -s ' '
Replace remaining space separators with line break: sed 's/ /\n/g'
Example:
echo -e " this 20 is 2sen\nten324ce 2 sort of" | tr '\n' ' ' | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | sed 's/ /\n/g'
Will print out
20
2
324
2

Here is a short one:
string="toto.titi.12.tata.2.abc.def"
id=$(echo "$string" | grep -o -E '[0-9]+')
echo $id // => output: 12 2
with space between the numbers.
Hope it helps...

Parameter expansion would seem to be the order of the day.
$ string="toto.titi.12.tata.2.abc.def"
$ read num1 num2 <<<${string//[^0-9]/ }
$ echo "$num1 / $num2"
12 / 2
This of course depends on the format of $string. But at least for the example you've provided, it seems to work.
This may be superior to anubhava's awk solution which requires a subshell. I also like chepner's solution, but regular expressions are "heavier" than parameter expansion (though obviously way more precise). (Note that in the expression above, [^0-9] may look like a regex atom, but it is not.)
You can read about this form or Parameter Expansion in the bash man page. Note that ${string//this/that} (as well as the <<<) is a bashism, and is not compatible with traditional Bourne or posix shells.

This would be easier to answer if you provided exactly the output you're looking to get. If you mean you want to get just the digits out of the string, and remove everything else, you can do this:
d#AirBox:~$ string="toto.titi.12.tata.2.abc.def"
d#AirBox:~$ echo "${string//[a-z,.]/}"
122
If you clarify a bit I may be able to help more.

You can also use sed:
echo "toto.titi.12.tata.2.abc.def" | sed 's/[0-9]*//g'
Here, sed replaces
any digits (class [0-9])
repeated any number of times (*)
with nothing (nothing between the second and third /),
and g stands for globally.
Output will be:
toto.titi..tata..abc.def

Convert your string to an array like this:
$ str="toto.titi.12.tata.2.abc.def"
$ arr=( ${str//[!0-9]/ } )
$ echo "${arr[#]}"
12 2

Use regular expression matching:
string="toto.titi.12.tata.2.abc.def"
[[ $string =~ toto\.titi\.([0-9]+)\.tata\.([0-9]+)\. ]]
# BASH_REMATCH[0] would be "toto.titi.12.tata.2.", the entire match
# Successive elements of the array correspond to the parenthesized
# subexpressions, in left-to-right order. (If there are nested parentheses,
# they are numbered in depth-first order.)
first_number=${BASH_REMATCH[1]}
second_number=${BASH_REMATCH[2]}

Using awk:
arr=( $(echo $string | awk -F "." '{print $3, $5}') )
num1=${arr[0]}
num2=${arr[1]}

Hi adding yet another way to do this using 'cut',
echo $string | cut -d'.' -f3,5 | tr '.' ' '
This gives you the following output:
12 2

Fixing newline issue (for mac terminal):
cat temp.txt | tr '\n' ' ' | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | sed $'s/ /\\\n/g'

Assumptions:
there is no embedded white space
the string of text always has 7 period-delimited strings
the string always contains numbers in the 3rd and 5th period-delimited positions
One bash idea that does not require spawning any subprocesses:
$ string="toto.titi.12.tata.2.abc.def"
$ IFS=. read -r x1 x2 num1 x3 num2 rest <<< "${string}"
$ typeset -p num1 num2
declare -- num1="12"
declare -- num2="2"
In a comment OP has stated they wish to extract only one number at a time; the same approach can still be used, eg:
$ string="toto.titi.12.tata.2.abc.def"
$ IFS=. read -r x1 x2 num1 rest <<< "${string}"
$ typeset -p num1
declare -- num1="12"
$ IFS=. read -r x1 x2 x3 x4 num2 rest <<< "${string}"
$ typeset -p num2
declare -- num2="2"
A variation on anubhava's answer that uses parameter expansion instead of a subprocess call to awk, and still working with the same set of initial assumptions:
$ arr=( ${string//./ } )
$ num1=${arr[2]}
$ num2=${arr[4]}
$ typeset -p num1 num2
declare -- num1="12"
declare -- num2="2"

Related

Add suffix to comma-separated strings in bash ecosystem

Is there a way of transforming a comma-delimited variable to add a suffix to each token using standard gnu tools? e.g.
VARIABLE=`aaa,bbb,ccc`
suffix=`-foo`
Expected output = `aaa-foo,bbb-foo,ccc-foo`
Additionally, if I have only one token, the transformation should behave in the same way
e.g. aaa -> aaa-foo
echo "aaa,bbb,ccc" | sed -E 's/([^,]+)/\1-foo/g'
It makes groups of characters that are not "," and then append -foo on it
With variables:
suffix="-foo"; VARIABLE="aaa,bbb,ccc"; echo ${VARIABLE} | sed -E "s/([^,]+)/\1${suffix}/g"
echo $VARIBLE | tr "," "\n" | awk '{print $1"-foo"}' | paste -sd "," -
explanation:
put each token on single line
tr "," "\n"
append "-foo" to each token
awk '{print $1"-foo"}'
join back up with the original comma
paste -sd "," -
Try:
answer = `echo $VARIABLE | sed "s/,/-foo,/g" | sed "s/$/-foo/"`
If you need to have the suffix as a variable then try:
answer = `echo $VARIABLE | sed "s/,/${suffix},/g" | sed "s/$/${suffix}/"`
I don't have access to a Unix box at the moment to prove this works.
The following:
s="aaa,bbb,ccc"
IFS=,
a=( $s )
mapfile -t b < <(printf '%s-foo\n' "${a[#]}")
should give us:
$ declare -p b
declare -a b=([0]="aaa-foo" [1]="bbb-foo" [2]="ccc-foo")
From there, if you can reconstruct the original format in a number of ways...
IFS=, eval 'JOINED="${b[*]}"'
Or if you don't like using eval, perhaps:
d=""; o=""
for x in "${b[#]}"; do
printf -v o '%s%s%s' "$o" "$d" "$x"
d=,
done
... which will put the complete modified string in $o.
With bash Parameter Expansion
var='aaa,bbb,ccc';[ -n "$var" ] && printf "%s\n" "${var//,/-foo,}-foo"

Bash Shell - Return substring after second occurrence of certain character

I need to return everything after a delimeter I decide but still don't fully know how to use sed.
What I need to do is:
$ echo "ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,," \
| sed <some regexp>
For this example the return should be (substring)everything after the second comma:
123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,
I can do this with cut like this:
echo "ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,," | cut -d',' -f 2-
but I've been told cut is slower than sed...
Can some guru who has them (and wants to... :) ) give me a few minutes of his time and advice me please?
Thanks!
Leo
In my experience cut is always faster than sed.
To do what you want with sed you could use a non-matching group:
echo 'ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,' |
sed -r 's/([^,]*,){2}//'
This removes the first two fields (if the fields do not contain commas themselves) by removing non-comma characters [^,] followed by a comma twice {2}.
Output:
123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,
You could also try doing the extraction in bash without spawning an external process at all:
$ [[ 'ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,' =~ [^,]*,[^,]*,(.*) ]]
$ echo "${BASH_REMATCH[#]}"
123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,
or
$ FOO='ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,'
$ echo ${FOO/+([^,]),+([^,]),}
or
$ IFS=, read -a FOO <<< 'ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,'
$ echo ${FOO[#]:2}
(Assuming this is for a one-off match, not iterating over the contents of a file.)
This method is by find the index of second occurrence of a character and using bash substring to get the required result
input="ABC DE,FG_HI J,123.XYZ-A1,DD/MM/YYYY HH24:MI:SS,,,"
index=$(($(echo $input| grep -aob '/' | grep -oE '[0-9]+' | awk 'NR==2') + 1))
result=${input:$index}

Consolidate sed command

I have a string called $string
string='-d $DESTDIR/ERRORS/$BASEDIR ]] || $MKDIR -p word1 word22 word3.5'
Currently I am piping this through sed twice. Once to pull out special characters and then again to change spaces/tabs to single space.
echo $string | sed 's/[^a-zA-Z0-9]/ /g' | sed 's/\s\s*/ /g'
output='d DESTDIR ERRORS BASEDIR MKDIR p word1 word22 word3 5'
Although this works I am looking to gain some efficiency.
Can someone help me consolidate this to a single sed command?
EDIT
I should have noted that this needs to be POSIX compatible for HP/SOL/LIN
Use tr:
echo $string | tr -Cs a-zA-Z0-9 ' '
tr is a very powerful (and fast) tool for translating, deleting and squeezing characters.
This particular command translates every character from the Complement of the first set (a-zA-Z0-9) into characters from the second set; since the second set contains only a space, this translates all non-alphanumeric characters (including tabs) into spaces. It then squeezes all sequences of characters from the second character set into a single character; this replaces runs of spaces with single spaces.
Example:
$ string='-d $DESTDIR/ERRORS/$BASEDIR ]] || $MKDIR -p word1 word22 word3.5'
$ output=$(echo $string | tr -Cs 'a-zA-Z0-9' ' ')
$ echo $output
d DESTDIR ERRORS BASEDIR MKDIR p word1 word22 word3 5
Try this:
echo $string | sed -r 's/[^[:alnum:]]/ /g;s/ +/ /g'
Guru.

linux shell title case

I am wrinting a shell script and have a variable like this: something-that-is-hyphenated.
I need to use it in various points in the script as:
something-that-is-hyphenated, somethingthatishyphenated, SomethingThatIsHyphenated
I have managed to change it to somethingthatishyphenated by stripping out - using sed "s/-//g".
I am sure there is a simpler way, and also, need to know how to get the camel cased version.
Edit: Working function derived from #Michał's answer
function hyphenToCamel {
tr '-' '\n' | awk '{printf "%s%s", toupper(substr($0,1,1)), substr($0,2)}'
}
CAMEL=$(echo something-that-is-hyphenated | hyphenToCamel)
echo $CAMEL
Edit: Finally, a sed one liner thanks to #glenn
echo a-hyphenated-string | sed -E "s/(^|-)([a-z])/\u\2/g"
a GNU sed one-liner
echo something-that-is-hyphenated |
sed -e 's/-\([a-z]\)/\u\1/g' -e 's/^[a-z]/\u&/'
\u in the replacement string is documented in the sed manual.
Pure bashism:
var0=something-that-is-hyphenated
var1=(${var0//-/ })
var2=${var1[*]^}
var3=${var2// /}
echo $var3
SomethingThatIsHyphenated
Line 1 is trivial.
Line 2 is the bashism for replaceAll or 's/-/ /g', wrapped in parens, to build an array.
Line 3 uses ${foo^}, which means uppercase (while ${foo,} would mean 'lowercase' [note, how ^ points up while , points down]) but to operate on every first letter of a word, we address the whole array with ${foo[*]} (or ${foo[#]}, if you would prefer that).
Line 4 is again a replace-all: blank with nothing.
Line 5 is trivial again.
You can define a function:
hypenToCamel() {
tr '-' '\n' | awk '{printf "%s%s", toupper(substr($0,0,1)), substr($0,2)}'
}
CAMEL=$(echo something-that-is-hyphenated | hypenToCamel)
echo $CAMEL
In the shell you are stuck with being messy:
aa="aaa-aaa-bbb-bbb"
echo " $aa" | sed -e 's/--*/ /g' -e 's/ a/A/g' -e 's/ b/B/g' ... -e 's/ *//g'
Note the carefully placed space in the echo and the double space in the last -e.
I leave it as an exercise to complete the code.
In perl it is a bit easier as a one-line shell command:
perl -e 'print map{ $a = ucfirst; $a =~ s/ +//g; $a} split( /-+/, $ARGV[0] ), "\n"' $aa
For the records, here's a pure Bash safe method (that is not subject to pathname expansion)—using Bash≥4:
var0=something-that-is-hyphenated
IFS=- read -r -d '' -a var1 < <(printf '%s\0' "${var0,,}")
printf '%s' "${var1[#]^}"
This (safely) splits the lowercase expansion of var0 at the hyphens, with each split part in array var1. Then we use the ^ parameter expansion to uppercase the first character of the fields of this array, and concatenate them.
If your variable may also contain spaces and you want to act on them too, change IFS=- into IFS='- '.

How to join multiple lines of filenames into one with custom delimiter

How do I join the result of ls -1 into a single line and delimit it with whatever I want?
paste -s -d joins lines with a delimiter (e.g. ","), and does not leave a trailing delimiter:
ls -1 | paste -sd "," -
EDIT: Simply "ls -m" If you want your delimiter to be a comma
Ah, the power and simplicity !
ls -1 | tr '\n' ','
Change the comma "," to whatever you want. Note that this includes a "trailing comma" (for lists that end with a newline)
This replaces the last comma with a newline:
ls -1 | tr '\n' ',' | sed 's/,$/\n/'
ls -m includes newlines at the screen-width character (80th for example).
Mostly Bash (only ls is external):
saveIFS=$IFS; IFS=$'\n'
files=($(ls -1))
IFS=,
list=${files[*]}
IFS=$saveIFS
Using readarray (aka mapfile) in Bash 4:
readarray -t files < <(ls -1)
saveIFS=$IFS
IFS=,
list=${files[*]}
IFS=$saveIFS
Thanks to gniourf_gniourf for the suggestions.
I think this one is awesome
ls -1 | awk 'ORS=","'
ORS is the "output record separator" so now your lines will be joined with a comma.
Parsing ls in general is not advised, so alternative better way is to use find, for example:
find . -type f -print0 | tr '\0' ','
Or by using find and paste:
find . -type f | paste -d, -s
For general joining multiple lines (not related to file system), check: Concise and portable “join” on the Unix command-line.
The combination of setting IFS and use of "$*" can do what you want. I'm using a subshell so I don't interfere with this shell's $IFS
(set -- *; IFS=,; echo "$*")
To capture the output,
output=$(set -- *; IFS=,; echo "$*")
Adding on top of majkinetor's answer, here is the way of removing trailing delimiter(since I cannot just comment under his answer yet):
ls -1 | awk 'ORS=","' | head -c -1
Just remove as many trailing bytes as your delimiter counts for.
I like this approach because I can use multi character delimiters + other benefits of awk:
ls -1 | awk 'ORS=", "' | head -c -2
EDIT
As Peter has noticed, negative byte count is not supported in native MacOS version of head. This however can be easily fixed.
First, install coreutils. "The GNU Core Utilities are the basic file, shell and text manipulation utilities of the GNU operating system."
brew install coreutils
Commands also provided by MacOS are installed with the prefix "g". For example gls.
Once you have done this you can use ghead which has negative byte count, or better, make alias:
alias head="ghead"
Don't reinvent the wheel.
ls -m
It does exactly that.
just bash
mystring=$(printf "%s|" *)
echo ${mystring%|}
This command is for the PERL fans :
ls -1 | perl -l40pe0
Here 40 is the octal ascii code for space.
-p will process line by line and print
-l will take care of replacing the trailing \n with the ascii character we provide.
-e is to inform PERL we are doing command line execution.
0 means that there is actually no command to execute.
perl -e0 is same as perl -e ' '
To avoid potential newline confusion for tr we could add the -b flag to ls:
ls -1b | tr '\n' ';'
It looks like the answers already exist.
If you want
a, b, c format, use ls -m ( Tulains Córdova’s answer)
Or if you want a b c format, use ls | xargs (simpified version of Chris J’s answer)
Or if you want any other delimiter like |, use ls | paste -sd'|' (application of Artem’s answer)
The sed way,
sed -e ':a; N; $!ba; s/\n/,/g'
# :a # label called 'a'
# N # append next line into Pattern Space (see info sed)
# $!ba # if it's the last line ($) do not (!) jump to (b) label :a (a) - break loop
# s/\n/,/g # any substitution you want
Note:
This is linear in complexity, substituting only once after all lines are appended into sed's Pattern Space.
#AnandRajaseka's answer, and some other similar answers, such as here, are O(n²), because sed has to do substitute every time a new line is appended into the Pattern Space.
To compare,
seq 1 100000 | sed ':a; N; $!ba; s/\n/,/g' | head -c 80
# linear, in less than 0.1s
seq 1 100000 | sed ':a; /$/N; s/\n/,/; ta' | head -c 80
# quadratic, hung
sed -e :a -e '/$/N; s/\n/\\n/; ta' [filename]
Explanation:
-e - denotes a command to be executed
:a - is a label
/$/N - defines the scope of the match for the current and the (N)ext line
s/\n/\\n/; - replaces all EOL with \n
ta; - goto label a if the match is successful
Taken from my blog.
If you version of xargs supports the -d flag then this should work
ls | xargs -d, -L 1 echo
-d is the delimiter flag
If you do not have -d, then you can try the following
ls | xargs -I {} echo {}, | xargs echo
The first xargs allows you to specify your delimiter which is a comma in this example.
ls produces one column output when connected to a pipe, so the -1 is redundant.
Here's another perl answer using the builtin join function which doesn't leave a trailing delimiter:
ls | perl -F'\n' -0777 -anE 'say join ",", #F'
The obscure -0777 makes perl read all the input before running the program.
sed alternative that doesn't leave a trailing delimiter
ls | sed '$!s/$/,/' | tr -d '\n'
Python answer above is interesting, but the own language can even make the output nice:
ls -1 | python -c "import sys; print(sys.stdin.read().splitlines())"
You can use:
ls -1 | perl -pe 's/\n$/some_delimiter/'
If Python3 is your cup of tea, you can do this (but please explain why you would?):
ls -1 | python -c "import sys; print(','.join(sys.stdin.read().splitlines()))"
ls has the option -m to delimit the output with ", " a comma and a space.
ls -m | tr -d ' ' | tr ',' ';'
piping this result to tr to remove either the space or the comma will allow you to pipe the result again to tr to replace the delimiter.
in my example i replace the delimiter , with the delimiter ;
replace ; with whatever one character delimiter you prefer since tr only accounts for the first character in the strings you pass in as arguments.
You can use chomp to merge multiple line in single line:
perl -e 'while (<>) { if (/\$/ ) { chomp; } print ;}' bad0 >test
put line break condition in if statement.It can be special character or any delimiter.
Quick Perl version with trailing slash handling:
ls -1 | perl -E 'say join ", ", map {chomp; $_} <>'
To explain:
perl -E: execute Perl with features supports (say, ...)
say: print with a carrier return
join ", ", ARRAY_HERE: join an array with ", "
map {chomp; $_} ROWS: remove from each line the carrier return and return the result
<>: stdin, each line is a ROW, coupling with a map it will create an array of each ROW

Resources