extract certain string from variable - string

I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture.
Therefore, I use:
Id=$(grep -m 10 -A 1 "data-adid" Textfile)
to get the first ten results.
The variable Id contains the following:
<arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
...
I would like to get the following output:
id="1234567890" id="2134567890" id="3124567890"
When using the grep command, I only managage to get the numbers, e.g.
Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")')
gets 1234567890 2134567890 3124567890
When trying
Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")')
this will only give me id= id= id=
How could the code be change to get the desired output?

Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk for this one. Written and tested in https://ideone.com/EpU1aW
echo "$var" |
awk '
match($0,/data-adid="[^"]*"/){
val=substr($0,RSTART,RLENGTH)
sub(/^data-ad/,"",val)
print val
val=""
}
'

data-ad is matching only data-ad - actually match the id= part too, with a " up until the next ". And I see no reason to use fancy lookarounds - just match the string and output the matched part only.
grep -oP 'data-ad\Kid="[^"]*"'
Should be enough. Note that $Id undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.

With any sed:
$ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file
id="1234567890"
id="2134567890"
id="2134567890"

Related

Cutting certain string of variable

I'd like to cut off some special strings of a variable.
The variable contains the following, including a lot of blank space before <div... and a class attribute:
<div data-href="/www.somewebspace.com" class="class1 class2">
I would like to extract the contents of the data-href attribute i.e have this output /www.somewebspace.com
I tried out the following code, the output starts with the contents of the data-href attribute and the class attribute.
echo $Test | grep -oP '(?<=<div data-href=").*(?=")'
How can I get rid of the class attribute?
Kind regards and grateful for every reply,
X3nion
P.S. Some other question arouse. I've got this strings I'd like to extract from a text file:
<div class="aditem-addon">
Today, 23:23</div>`
What would be the correct command to extract only the "Today, 23:23" without any spaces and spaces before and after the term?
Maybe I would have to delete the black spaces before?
your regex is correct, you only need to adjust the greediness of the * quantifier:
* is a greedy quantifier : match as much as possible whilst getting a match
*? is a reluctant quantifier : match the minimum characters to get a match
# Correct
Test='<div data-href="/www.somewebspace.com" class="fdgks"></div>'
echo $Test | grep -oP '(?<=<div data-href=").*?(?=")'
#> /www.somewebspace.com
# the desired output
# WRONG
echo $Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks
# didn't stop until it matched the last quote `"`
echo $Test$Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks"></div><div data-href="/www.somewebspace.com" class="fdgks
# same as the last one
for a more detailed explanation about the difference between greedy, reluctant and possessive quantifiers (see)
EDIT
echo $Test$Test | grep -Poz '(?<=<div class="aditem-addon">\n ).*?(?=<\/div>)'
#> Today, 23:23
#> Today, 23:23
\n matches a newline an a leading space.
if the string you're looking for contains the newline character \n you'll need to add the z option to grep i.e the call will be grep -ozP
Unless the input is very simple, considering using xmllint or other html parsing tool. For the very simple cases, you can use bash solution:
#! /bin/sh
s=' <div data-href="/www.somewebspace.com" class="class1 class2"> '
s1=${s##*data-href=\"}
s1=${s1%%\"*}
echo "$s1"
Which will print
/www.somewebspace.com

Remove path prefix of space separated paths

Given a list of paths separated by a single space:
/home/me/src/test /home/me/src/vendor/a /home/me/src/vendor/b
I want to remove the prefix /home/me/src/ so that the result is:
test vendor/a vendor/b
For a single path I would do: ${PATH#/home/me/src/} but how do I apply it to this series?
You can use // to replace all occurrences of substring. Replace it with null string to remove them.
$ path="/home/me/src/test /home/me/src/vendor/a /home/me/src/vendor/b"
$ echo ${path//\/home\/me\/src\/}
test vendor/a vendor/b
Reference: ${parameter/pattern/string} in Bash reference manual
Using shell parameter expansion doesn't seem to be the solution for this, since it would remove everything up to / from a given point is useful, as nu11p01n73R's answer reveals.
For clarity, I would use sed with the syntax sed 's#pattern#replacement#g':
$ str="/home/me/src/test /home/me/src/vendor/a /home/me/src/vendor/b"
$ sed 's#/home/me/src/##g' <<< "$str"
test vendor/a vendor/b
Like always a grep solution from my side :
echo 'your string' | grep -Po '^/([^ /]*/)+\K.+'
Please note that the above regex do this for any string like /x/y/z/test ... But if you are interested only in replacing /home/me/src/, try the following :
echo 'your string' | grep -Po '^/home/me/src/\K.+' --color

Extracting strings from output in Linux console

I have been trying to extract specific strings from the output in Linux
For example:
ps -eo pid,args | grep PRD_ | egrep startscen.sh | more
gives the following output
(Full-size image: http://i.imgur.com/reS7wZ1.png)
I am aware awk, sed, tr can be used to extract details like PID but I am not sure how to write a query to get exactly the pid of the row where the fourth column has a specific string like 'PROCESS_ALL_BETS'
Or how do I extract every character after _NAME=?
Awk to the rescue.
ps -eo pid,args | awk '/PRD_/ && /startscen\.sh/ && $4 ~ /PROCESS_ALLBETS/'
(In the image, you have PROCESS_ALLBETS, so I guess that's what you actually want, even though your text says PROCESS_ALL_BETS.)
This selects for printing every line which matches all the following conditions:
/PRD_/ -- there is a "PRD_" somewhere in the line. Maybe you would tighten this to something like $6 ~ /^-NAME=PRD_/ to only match on the beginning of the sixth field.
/stratscen\.sh/ -- there is a match for this regex somewhere on the line. Again, for improved precision, you might want to change this to $3 ~ /startscen\.sh/ or even $3 == "startscen.sh" if you only want exact matches.
$4 ~ /PROCESS_ALLBETS/ -- the fourth field matches this regular expression.
The above will simply print all matching lines. To print just the first field and the eight field with the prefix -SESSION_NAME= removed, add something like
{ n=$8; sub(/^-SESSION_NAME=/,"",n); print $1, n }
just before the closing single quote.

Extracting part of a string to a variable in bash

noob here, sorry if a repost. I am extracting a string from a file, and end up with a line, something like:
abcdefg:12345:67890:abcde:12345:abcde
Let's say it's in a variable named testString
the length of the values between the colons is not constant, but I want to save the number, as a string is fine, to a variable, between the 2nd and 3rd colons. so in this case I'd end up with my new variable, let's call it extractedNum, being 67890 . I assume I have to use sed but have never used it and trying to get my head around it...
Can anyone help? Cheers
On a side-note, I am using find to extract the entire line from a string, by searching for the 1st string of characters, in this case the abcdefg part.
Pure Bash using an array:
testString="abcdefg:12345:67890:abcde:12345:abcde"
IFS=':'
array=( $testString )
echo "value = ${array[2]}"
The output:
value = 67890
Here's another pure bash way. Works fine when your input is reasonably consistent and you don't need much flexibility in which section you pick out.
extractedNum="${testString#*:}" # Remove through first :
extractedNum="${extractedNum#*:}" # Remove through second :
extractedNum="${extractedNum%%:*}" # Remove from next : to end of string
You could also filter the file while reading it, in a while loop for example:
while IFS=' ' read -r col line ; do
# col has the column you wanted, line has the whole line
# # #
done < <(sed -e 's/\([^:]*:\)\{2\}\([^:]*\).*/\2 &/' "yourfile")
The sed command is picking out the 2nd column and delimiting that value from the entire line with a space. If you don't need the entire line, just remove the space+& from the replacement and drop the line variable from the read. You can pick any column by changing the number in the \{2\} bit. (Put the command in double quotes if you want to use a variable there.)
You can use cut for this kind of stuff. Here you go:
VAR=$(echo abcdefg:12345:67890:abcde:12345:abcde |cut -d":" -f3); echo $VAR
For the fun of it, this is how I would (not) do this with sed, but I'm sure there's easier ways. I guess that'd be a question of my own to future readers ;)
echo abcdefg:12345:67890:abcde:12345:abcde |sed -e "s/[^:]*:[^:]*:\([^:]*\):.*/\1/"
this should work for you: the key part is awk -F: '$0=$3'
NewVar=$(getTheLineSomehow...|awk -F: '$0=$3')
example:
kent$ newVar=$(echo "abcdefg:12345:67890:abcde:12345:abcde"|awk -F: '$0=$3')
kent$ echo $newVar
67890
if your text was stored in var testString, you could:
kent$ echo $testString
abcdefg:12345:67890:abcde:12345:abcde
kent$ newVar=$(awk -F: '$0=$3' <<<"$testString")
kent$ echo $newVar
67890

How do I count the number of occurrences of a string in an entire file?

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to count the number of times a certain string (not word) appears in a file. This can include multiple occurrences per line so the count should count every occurrence not just count 1 for lines that have the string 2 or more times.
For example, with this sample file:
blah(*)wasp( *)jkdjs(*)kdfks(l*)ffks(dl
flksj(*)gjkd(*
)jfhk(*)fj (*) ks)(*gfjk(*)
If I am looking to count the occurrences of the string (*) I would expect the count to be 6, i.e. 2 from the first line, 1 from the second line and 3 from the third line. Note how the one across lines 2-3 does not count because there is a LF character separating them.
Update: great responses so far! Can I ask that the script handle the conversion of (*) to \(*\), etc? That way I could just pass any desired string as an input parameter without worrying about what conversion needs to be done to it so it appears in the correct format.
You can use basic tools such as grep and wc:
grep -o '(\*)' input.txt | wc -l
Using perl's "Eskimo kiss" operator with the -n switch to print a total at the end. Use \Q...\E to ignore any meta characters.
perl -lnwe '$a+=()=/\Q(*)/g; }{ print $a;' file.txt
Script:
use strict;
use warnings;
my $count;
my $text = shift;
while (<>) {
$count += () = /\Q$text/g;
}
print "$count\n";
Usage:
perl script.pl "(*)" file.txt
This loops over the lines of the file, and on each line finds all occurrences of the string "(*)". Each time that string is found, $c is incremented. When there are no more lines to loop over, the value of $c is printed.
perl -ne'$c++ while /\(\*\)/g;END{print"$c\n"}' filename.txt
Update: Regarding your comment asking that this be converted into a solution that accepts a regex as an argument, you might do it like this:
perl -ne'BEGIN{$re=shift;}$c++ while /\Q$re/g;END{print"$c\n"}' 'regex' filename.txt
That ought to do the trick. If I felt inclined to skim through perlrun again I might see a more elegant solution, but this should work.
You could also eliminate the explicit inner while loop in favor of an implicit one by providing list context to the regexp:
perl -ne'BEGIN{$re=shift}$c+=()=/\Q$re/g;END{print"$c\n"}' 'regex' filename.txt
You can use basic grep command:
Example: If you want to find the no of occurrence of "hello" word in a file
grep -c "hello" filename
If you want to find the no of occurrence of a pattern then
grep -c -P "Your Pattern"
Pattern example : hell.w, \d+ etc
I have used below command to find particular string count in a file
grep search_String fileName|wc -l
text="(\*)"
grep -o $text file | wc -l
You can make it into a script which accepts arguments like this:
script count:
#!/bin/bash
text="$1"
file="$2"
grep -o "$text" "$file" | wc -l
Usage:
./count "(\*)" file_path

Resources