Cutting certain string of variable - string

I'd like to cut off some special strings of a variable.
The variable contains the following, including a lot of blank space before <div... and a class attribute:
<div data-href="/www.somewebspace.com" class="class1 class2">
I would like to extract the contents of the data-href attribute i.e have this output /www.somewebspace.com
I tried out the following code, the output starts with the contents of the data-href attribute and the class attribute.
echo $Test | grep -oP '(?<=<div data-href=").*(?=")'
How can I get rid of the class attribute?
Kind regards and grateful for every reply,
X3nion
P.S. Some other question arouse. I've got this strings I'd like to extract from a text file:
<div class="aditem-addon">
Today, 23:23</div>`
What would be the correct command to extract only the "Today, 23:23" without any spaces and spaces before and after the term?
Maybe I would have to delete the black spaces before?

your regex is correct, you only need to adjust the greediness of the * quantifier:
* is a greedy quantifier : match as much as possible whilst getting a match
*? is a reluctant quantifier : match the minimum characters to get a match
# Correct
Test='<div data-href="/www.somewebspace.com" class="fdgks"></div>'
echo $Test | grep -oP '(?<=<div data-href=").*?(?=")'
#> /www.somewebspace.com
# the desired output
# WRONG
echo $Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks
# didn't stop until it matched the last quote `"`
echo $Test$Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks"></div><div data-href="/www.somewebspace.com" class="fdgks
# same as the last one
for a more detailed explanation about the difference between greedy, reluctant and possessive quantifiers (see)
EDIT
echo $Test$Test | grep -Poz '(?<=<div class="aditem-addon">\n ).*?(?=<\/div>)'
#> Today, 23:23
#> Today, 23:23
\n matches a newline an a leading space.
if the string you're looking for contains the newline character \n you'll need to add the z option to grep i.e the call will be grep -ozP

Unless the input is very simple, considering using xmllint or other html parsing tool. For the very simple cases, you can use bash solution:
#! /bin/sh
s=' <div data-href="/www.somewebspace.com" class="class1 class2"> '
s1=${s##*data-href=\"}
s1=${s1%%\"*}
echo "$s1"
Which will print
/www.somewebspace.com

Related

extract certain string from variable

I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture.
Therefore, I use:
Id=$(grep -m 10 -A 1 "data-adid" Textfile)
to get the first ten results.
The variable Id contains the following:
<arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
...
I would like to get the following output:
id="1234567890" id="2134567890" id="3124567890"
When using the grep command, I only managage to get the numbers, e.g.
Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")')
gets 1234567890 2134567890 3124567890
When trying
Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")')
this will only give me id= id= id=
How could the code be change to get the desired output?
Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk for this one. Written and tested in https://ideone.com/EpU1aW
echo "$var" |
awk '
match($0,/data-adid="[^"]*"/){
val=substr($0,RSTART,RLENGTH)
sub(/^data-ad/,"",val)
print val
val=""
}
'
data-ad is matching only data-ad - actually match the id= part too, with a " up until the next ". And I see no reason to use fancy lookarounds - just match the string and output the matched part only.
grep -oP 'data-ad\Kid="[^"]*"'
Should be enough. Note that $Id undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.
With any sed:
$ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file
id="1234567890"
id="2134567890"
id="2134567890"

Remove path prefix of space separated paths

Given a list of paths separated by a single space:
/home/me/src/test /home/me/src/vendor/a /home/me/src/vendor/b
I want to remove the prefix /home/me/src/ so that the result is:
test vendor/a vendor/b
For a single path I would do: ${PATH#/home/me/src/} but how do I apply it to this series?
You can use // to replace all occurrences of substring. Replace it with null string to remove them.
$ path="/home/me/src/test /home/me/src/vendor/a /home/me/src/vendor/b"
$ echo ${path//\/home\/me\/src\/}
test vendor/a vendor/b
Reference: ${parameter/pattern/string} in Bash reference manual
Using shell parameter expansion doesn't seem to be the solution for this, since it would remove everything up to / from a given point is useful, as nu11p01n73R's answer reveals.
For clarity, I would use sed with the syntax sed 's#pattern#replacement#g':
$ str="/home/me/src/test /home/me/src/vendor/a /home/me/src/vendor/b"
$ sed 's#/home/me/src/##g' <<< "$str"
test vendor/a vendor/b
Like always a grep solution from my side :
echo 'your string' | grep -Po '^/([^ /]*/)+\K.+'
Please note that the above regex do this for any string like /x/y/z/test ... But if you are interested only in replacing /home/me/src/, try the following :
echo 'your string' | grep -Po '^/home/me/src/\K.+' --color

Extract Digits From String After Capturing It From File

I'm trying to retrieve a memory value from file, and compare it to reference value. But one thing at a time....
I've attempted using set/source/grep/substring to variable but non of them actually worked. Then I found a way to do it using a for loop (see code).
The issue: I'm receiving the entire string from the file, but I can't manage to get rid of the last character in it.
#!/bin/bash
#source start_params.properties
#mem_val= "$default.default.minmaxmemory.main"
#mem_val= grep "default.default.minmaxmemory.main" start_params.properties
for mLine in $(grep 'default.default.minmaxmemory.main' start_params.properties)
do
echo "$mLine"
done
echo "${mLine:4:5}" # didn't get rid of the last `m` in `-max4095m`
v1="max"
v2="m"
echo "$mLine" | sed -e "s/.*${v1}//;s/${v2}.*//" #this echo the right value.
The loop iterates twice:
First output: default.default.minmaxmemory.main=-min512m
Second output: -max4096m
Then the sed command output is 4096,but how can I change the last line in the code S.T. it'll store the value in a variable?
Thank you for your suggestions,
You could use grep to filter the max part and then another a grep -o to extract the numbers:
echo "$mLine" | grep "$max" | grep -o '[[:digit:]]*'
$ sed '/max[0-9]/!d; s/.*max//; s/m//' start_params.properties
4096
remove lines not matching max[0-9]
remove first part of line until max
remove final m

Extracting part of a string to a variable in bash

noob here, sorry if a repost. I am extracting a string from a file, and end up with a line, something like:
abcdefg:12345:67890:abcde:12345:abcde
Let's say it's in a variable named testString
the length of the values between the colons is not constant, but I want to save the number, as a string is fine, to a variable, between the 2nd and 3rd colons. so in this case I'd end up with my new variable, let's call it extractedNum, being 67890 . I assume I have to use sed but have never used it and trying to get my head around it...
Can anyone help? Cheers
On a side-note, I am using find to extract the entire line from a string, by searching for the 1st string of characters, in this case the abcdefg part.
Pure Bash using an array:
testString="abcdefg:12345:67890:abcde:12345:abcde"
IFS=':'
array=( $testString )
echo "value = ${array[2]}"
The output:
value = 67890
Here's another pure bash way. Works fine when your input is reasonably consistent and you don't need much flexibility in which section you pick out.
extractedNum="${testString#*:}" # Remove through first :
extractedNum="${extractedNum#*:}" # Remove through second :
extractedNum="${extractedNum%%:*}" # Remove from next : to end of string
You could also filter the file while reading it, in a while loop for example:
while IFS=' ' read -r col line ; do
# col has the column you wanted, line has the whole line
# # #
done < <(sed -e 's/\([^:]*:\)\{2\}\([^:]*\).*/\2 &/' "yourfile")
The sed command is picking out the 2nd column and delimiting that value from the entire line with a space. If you don't need the entire line, just remove the space+& from the replacement and drop the line variable from the read. You can pick any column by changing the number in the \{2\} bit. (Put the command in double quotes if you want to use a variable there.)
You can use cut for this kind of stuff. Here you go:
VAR=$(echo abcdefg:12345:67890:abcde:12345:abcde |cut -d":" -f3); echo $VAR
For the fun of it, this is how I would (not) do this with sed, but I'm sure there's easier ways. I guess that'd be a question of my own to future readers ;)
echo abcdefg:12345:67890:abcde:12345:abcde |sed -e "s/[^:]*:[^:]*:\([^:]*\):.*/\1/"
this should work for you: the key part is awk -F: '$0=$3'
NewVar=$(getTheLineSomehow...|awk -F: '$0=$3')
example:
kent$ newVar=$(echo "abcdefg:12345:67890:abcde:12345:abcde"|awk -F: '$0=$3')
kent$ echo $newVar
67890
if your text was stored in var testString, you could:
kent$ echo $testString
abcdefg:12345:67890:abcde:12345:abcde
kent$ newVar=$(awk -F: '$0=$3' <<<"$testString")
kent$ echo $newVar
67890

split a string variable through shell script

i have a string containing date and time as timestamp= 12-12-2012 16:45:00
I need to reformat it into timestamp= 16:45:00 12-12-2012
How to achieve this in shell script?
Note Please : variable's value is 12-12-2012 16:45:00 and timestamp is the name of variable
#!usr/bin/expect
set timestamp "16:45:00 12-12-2012"
Now what should i do so value of timestamp will become 12-12-2012 16:45:00
script extention is .tcl example test.tcl
You could use variable patterned removal. ## means "greedily remove everything that matches the pattern, starting from the left". %% means the same from the right:
tm=${timestamp##* }
dt=${timestamp%% *}
result="$tm $dt"
or you could use cut to do the same, giving a field delimiter:
tm=$(echo $timestamp | cut -f2 -d' ')
dt=$(echo $timestamp | cut -f1 -d' ')
result="$tm $dt"
or you could use sed to swap them with a regex (see other post).
or if you are pulling the date from the date command, you could ask it to format it for you:
result=$(date +'%r %F')
and for that matter, you might have a version of date that will parse your date and then let you express it however you want:
result=$(date -d '12/12/2012 4:45 pm' +'%r %F')
admittedely, this last one is picky about date input...see "info date" for information on accepted inputs.
If you want to use regex, I like Perl's...they are cleaner to write:
echo $timestamp | perl -p -e 's/^(\S+)\s+(\S+)/$2 $1/'
where \S matches non-space characters, + means "one or more", and \s matches spaces. The parens do captures of the parts matched.
EDIT:
Sorry, didn't realize that the "timestamp=" was part of the actual data. All of the above example work if you first strip that bit out:
var='timestamp=2012-12-12 16:45:11'
timestamp=${var#timestamp=}
... then as above ...
Using sed:
sed 's/\([0-9]*-[0-9]*-[0-9]*\)\([ \t]*\)\(.*\)/\3\2\1/' input
this command works on lines containing the pattern number-number-number whitespace antyhing. It simply swaps the number-number-number part \([0-9]*-[0-9]*-[0-9]*\) with the anything part \(.*\), also keeping the original whitespaces \([ \t]*\). So the replace part of sed is \3\2\1, which means the third part, white spaces, and the first part.
Same logic with tcl:
set timestamp "12-12-2012 16:45:00"
set s [regsub {([0-9]*-[0-9]*-[0-9]*)([ \t]*)(.*)} $timestamp \\3\\2\\1]
puts $s
awk solution here:
string="timestamp= 12-12-2012 16:45:00"
awk '{print $1, $3, $2}' <<< "$string"
In bash (and similar shells):
$ timestamp="12-12-2012 16:45:00"
$ read -a tsarr <<< "$timestamp"
$ echo "${tsarr[1]} ${tsarr[0]}"
16:45:00 12-12-2012

Resources