Extract string from html files

Extract string from html files - string

I have many html files. each file contain the follwing line :
<img src="<BASE_HTTP_URL>bladf.gif" border="0" alt="" />
I need to extract first the html file name, and then the file name after BASE_HTTP_URL. in this case it is bladf.gif it can be any file name and many kind of extentions.
I have tried to extract the name of the file by using this awk :
for f in *.html
do
awk -F'"' '/img src=/{print $4}' $f
done
but i get zero as a result. how can i print the file name and next to it the file name next to BASE_HTTP_URL?
thanks

awk -F'"' '/img src=/{match($2, "(.*/)(.*)", url); print $2, url[1], url[2]}'
if I correctly understand your need.
Here's the sample output:
alex#rhyme ~ $ echo '<img src="http://some/url/bladf.gif" border="0" alt="" />' | awk -F'"' '/img src=/{match($2, "(.*/)(.*)", url); print $2, url[1], url[2];}'
http://some/url/bladf.gif http://some/url/ bladf.gif
alex#rhyme ~ $ awk --version
GNU Awk 4.0.2
Copyright (C) 1989, 1991-2012 Free Software Foundation.
What is your awk version?

Let's start with this:
$ cat file1.html
foo
<img src="<BASE_HTTP_URL>bladf.gif" border="0" alt="" />
bar
$ cat file2.html
foo
<img src="<BASE_HTTP_URL>whatever.gif" border="0" alt="" />
bar
$ awk -F'"' '/img src=/{print FILENAME, $2}' *.html
file1.html <BASE_HTTP_URL>bladf.gif
file2.html <BASE_HTTP_URL>whatever.gif
or:
$ awk -F'"' 'sub(/<img src="<BASE_HTTP_URL>/,""){print FILENAME, $1}' *.html
file1.html bladf.gif
file2.html whatever.gif
If none of that is what you wanted, update your question to clarify.

Related

how to get values from file using sed,awk or grep on linux command/scripting?

i have file1 with value:
<action>
<row>
<column name="book" label="book">stick man (2020)/</column>
<column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30/</column>
</row>
<row>
<column name="book" label="book">python easy (2019)/</column>
<column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30/</column>
</row>
</action>
i want to get the contents of the file using linux scripting or command (sed, grep or awk). example output:
stick man (2020) | http://172.22.215.234/Data/Book/Journal/2016_2020/1%/20Stick%20%282020%30
python easy (2019) | http://172.22.215.234/Data/Book/Journal/2016_2020/%2/20Buck%20%282019%30
my code:
grep -oP 'href="([^".]*)">([^</.]*)' file1
please help i am newbie :)

$ awk -v RS='<[^>]+>' 'NF{printf "%s", $0 (++c%2?" |":ORS)}' file
stick man (2020)/ | http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30/
python easy (2019)/ | http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30/
note that forward slashes are in your original data
requires multi-char RS support (GNU awk).

This
<action>
<row>
<column name="book" label="book">stick man (2020)/</column>
<column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30/</column>
</row>
<row>
<column name="book" label="book">python easy (2019)/</column>
<column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30/</column>
</row>
</action>
does looks like piece of HTML file. If you are allowed to install utilites in your system I suggest giving a try hxselect which is useful when you want to extract something you can describe in CSS language. For example to get content of all columns whose label is referensi from file.html:
cat file.html | hxselect -i -c -s '\n' column[label=referensi]

With awk you can try:
awk -F'>|/<' '{ORS= (NR == 3 || NR == 7) ? " |" : "\n"} $2 != "" {print $2}' file
stick man (2020) | http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30
python easy (2019) | http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30
Or shorter:
awk -F'>|/<' '{ORS= (NR%2) ? " |" : RS} $2 != "" {print $2}' file

Sed - adding tags to the beginning and end of a line while skipping empty lines

I have multiple text file that i'm trying to add paragraph tags to the beginning and end of each line in the files, while skipping the 1st and empty lines.
So far i came up with the below code, but it's not skipping empty lines and its adding the below on a new line.
for i in *.txt; do sed -i -e '1 ! s/.*/<p>&<\/p>/' $i; done
For example lets say the text file looks like this:
This Is the File Name
Paragraph 1
Paragraph 2
Paragraph 3
Paragraph 4
This is the output i'm getting with my code
This Is the File Name
<p>
</p>
<p>Paragraph 1
</p>
<p>
</p>
<p>Paragraph 2
</p>
<p>
</p>
<p>Paragraph 3
</p>
<p>
</p>
<p>Paragraph 4</p>
What i'm trying to get is this:
This Is the File Name
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<p>Paragraph 4</p>

This happens because .* matches empty strings. Simply make it require at least one character with ..*:
$ sed -i -e '1 ! s|..*|<p>&</p>|' file.txt
$ cat file.txt
This Is the File Name
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<p>Paragraph 4</p>

Using awk:
awk '$1 !~ /^$/ {print "<p>" $0 "</p>"} $1 ~ /^$/ { print ""}' file

How to grep particular lines

I am trying to fetch Some IDs from URL.
In my script I hit the URL using while loop and wget command and I save output in file.
Then in same loop I grep XYZ User ID: and 3 lines after this string and save it to another file.
When I open this output file I find following lines.
< p >XYZ User ID:< /p>
< /td >
< td>
< p>2989288174< /p>
So using grep or any thing else how can I print following output
XYZ User ID:2989288174

Supposing a constant tag pattern:
<p>XYZ User ID:</p>
</td>
<td>
<p>2989288174</p>
grep should be the best way:
grep -oP '(?<=p>)([^>]+?)(?=<\/p)' outputfile|while read user;do
read id
echo "$user $id"
done
Note that look-behind expressions cannot be of variable length. That means you cannot use quantifiers ?, *, + , etc or alternation of different-length items inside them.
For variable length tags awk could be well suited for oneliner tags:
awk '/User ID/{print ""}/p *>/{printf $3}' FS='(p *>|<)' outputfile

This should work (sed with extended regex):
sed -nr 's#<\s*p\s*>([^>]*)<\s*/\s*p\s*>#\1#p' file | tr -d '\n'
Output:
XYZ User ID:2989288174

how to delete line after specific pattern and extract something

UPDATE
This is my file:
<department name="/fighters" id="123879" group="channel" case="none" use="no">
<options index_name="index.html" listing="0" sum="no" allowed="no" />
<target prefix="ttp" suffix=".net" />
<type="effort">
<region="20491" readonly="fs1a" readwrite="fs1a" upload="yes" download="yes" repl="yes" hard="0" soft"0" prio="0" write="no" stage="yes" migrate="no" size="0" >
<read="content" readwrite="content" hard="215822106624" soft="237296943104" prio="5" write="yes" stage="yes" migrate="no" size="0" />
<overflow name="20491-set-writable" />
</replicate>
<region="20576" readonly="fs1a" readwrite="fs1a" upload="yes" download="yes" repl="yes" hard="0" soft"0" prio="0" write="no" stage="yes" migrate="no" size="0" >
<read="content" readwrite="content" hard="215822106624" soft="237296943104" prio="5" write="yes" stage="yes" migrate="no" size="0" />
<overflow name="20576-set-writable" />
</replicate>
</replication>
<user="T:106603" />
<user="T:123879" />
<user="test" />
<user="ele::123456" />
<user="company-temp" />
<user="companymw2" />
<user="bird" />
<user="coding11" />
<user="plazamedia" />
<allow go="123456=abcdefghijklmnopqrstuvwxyz" />
</department>
I wrote a bash like:
awk < test.xml -Fuser= '{ print $2 }' | sed '/^$/d' | cut -d" " -f1
and result is something like:
"T:106603"
"T:123879"
"test"
"ele::123456"
"company-temp"
"companymw2"
"bird"
"coding11"
"plazamedia"
But imagine the result is:
"T:106603" />
"T:123879" />
"test" />
"ele::123456" />
"company-temp" />
"companymw2" />
"bird" />
"coding11" />
"plazamedia" />
first,How can I say remove every thing after second "?
secondly, how can I say extract everything between " "?
I like doing it with sed or awk
Thank you in advance

Try this:
awk -F'"' '/<user=/{ print $2 }' file

Using only sed:
$ sed 's/^<user=\(.*"\).*/\1/' test.xml # With quotes
$ sed 's/^<user="\(.*\)".*/\1/' test.xml # Without quotes

Try this cut,
cut -d'"' -f 2 test.xml
Try this sed,
With quotes("):
sed 's/^.*\("[^"]\+"\).*/\1/g' test.xml
Without quotes("):
sed 's/^.*"\([^"]\+\)".*/\1/g' test.xml
UPDATE:
sed -e '/^<user/!{d}' -e '/^<user/s/^.*"\([^"]\+\)".*/\1/' test.xml

If you want to get rid of the sed and cut in the pipeline, there are many ways to do that, depending on what the corner cases are. The simplest to me would seem to be
awk -F'"' '/<user=/ { print "\"$2\"" }' test.xml
As usual, here's the obligatory don't parse XML with regex link.
Slightly interesting corner cases would be if there can be quoted double quotes in the string (but usually XML would use entities instead) or if the elements can have multiple attributes. If there could be multiple <user=...> elements on a single line, this will quickly become more complex than the proper solution, which is to use XSLT.

Try :
$ awk '/<user=/ && gsub(/<user=|\/>/,x)' file
"T:106603"
"T:123879"
"test"
"ele::123456"
"company-temp"
"companymw2"
"bird"
"coding11"
"plazamedia"
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk

Using gnu grep
grep -Po 'user=\K"[^"]*"' file

Replace a line in a file with a string

I have a file file1 with the following content
{"name":"clio5", "value":"13"}
{"name":"citroen_c4", "value":"23"}
{"name":"citroen_c3", "value":"12"}
{"name":"golf4", "value":"16"}
{"name":"golf3", "value":"8"}
I want to look for the line which contains the word clio5 and then replace the found line by the following string
string='{"name":"clio5", "value":"1568688554"}'

$ string='{"name":"clio5", "value":"1568688554"}'
$ awk -F'"(:|, *)"' -v string="$string" 'BEGIN{split(string,s)} {print ($2==s[2]?string:$0)}' file
{"name":"clio5", "value":"1568688554"}
{"name":"citroen_c4", "value":"23"}
{"name":"citroen_c3", "value":"12"}
{"name":"golf4", "value":"16"}
{"name":"golf3", "value":"8"}
$ string='{"name":"citroen_c3", "value":"1568688554"}'
$ awk -F'"(:|, *)"' -v string="$string" 'BEGIN{split(string,s)} {print ($2==s[2]?string:$0)}' file
{"name":"clio5", "value":"13"}
{"name":"citroen_c4", "value":"23"}
{"name":"citroen_c3", "value":"1568688554"}
{"name":"golf4", "value":"16"}
{"name":"golf3", "value":"8"}
Updated the above based on #dogbane's comment so it will work even if the text contains "s. It will still fail if the text can contain ":" (with appropriate escapes) but that seems highly unlikely and the OP can tell us if it's a valid concern.

First you extract the name part from your $string as
NAME=`echo $string | sed 's/[^:]*:"\([^"]*\).*/\1/'`
Then, use the $NAME to replace the string as
sed -i "/\<$NAME\>/s/.*/$string/" file1

Use awk like this:
awk -v str="$string" -F '[,{}:]+' '{
split(str, a);
if (a[3] ~ $3)
print str;
else print
}' file.json

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract string from html files - string

Related

how to get values from file using sed,awk or grep on linux command/scripting?

Sed - adding tags to the beginning and end of a line while skipping empty lines

How to grep particular lines

how to delete line after specific pattern and extract something

Replace a line in a file with a string

Categories

Resources