extract a specific word between two values - string

I curl a html page and stock output into variable, so I try to extract a word between two value, but I failed.
</tr> <tr> <td><a AAA</td>
<td>Thu Aug 30 09:59:36 UTC 2018</td> <td align="right"> 2247366 </td>
<td></td> </tr> <tr> <td>1.1.22</td> <td>Thu Aug 30 09:59:36
UTC 2018</td> <td align="right"> 5 </td> <td></td> </tr> </table>
</body> </html>
content=$(curl -s https://test/one/)
echo $content | sed -E 's_.*one/([^"]+).*_\1_'
I try to catch value after one/ and before ", so I want to extract AAA, 1.1.22,...

$ ... | sed -E 's_.*one/([^"]+).*_\1_'
AAA
BBB
since you have slash in your content, better to choose a different delimiter, here I used _.
UPDATE
Since you changed the input file format dramatically, here is the updated script
$ echo "$contents" | sed -nE '/one/s_.*one/([^"]+).*_\1_p'
AAA
1.1.22

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath, check this example
Check: Using regular expressions with HTML tags
Example using xpath :
//a[contains(#href, "https://test/sites/two/one")]

Related

How to specify and extract html element by curl

when I tried to curl some pages.
curl http://test.com
I can get like following result
<html>
<body>
<div>
<dl>
<dd> 10 times </dd>
</dl>
</div>
</body>
</html>
my desired result is like simply 10 times..
Are there any good way to achieve this ?
If someone has opinion please let me know
Thanks
If you are are unable to use a html parser for what ever reason, for your given simple html example, you could use:
curl http://test.com | sed -rn 's#(^.*<dd>)(.*)(</dd>)#\2#p'
Redirect the output of the curl command into sed and enable regular expression interpretation with -r or -E. Split the lines into three sections and substitute the line for the second section only, printing the result.

Parsing HTML table in Bash using sed

In bash I am trying to parse following file:
Input:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Wanted output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples~withstuff"
11/07/2011 67100270 "https://stuff.com/findstones"
I got to the point that I have:
# less input.txt | sed -e "s/><tr><td//" -e "s/\///" -e "s/a>//" -e "s/<\/td><\/tr>//g" -e "s/<\/td><td>//g" -e "s/>$//g" -e "s/<a class=\"btn-down\" download href=//g"
<stuff.txt (15.18 KB)12/01/2015Large things158520312"https://resource.com/stones"
<flowers.pdf (83.03 MB)23/03/2011Large flowers872448000"https://resource.com/flosers with stuff"
<apples.pdf (281.16 MB)21/04/2012Large things like apples299009564"https://resource.com/apples"
<stones.pdf (634.99 MB)11/07/2011Large stones from mountains67100270"https://stuff.com/findstones"
Is there a easier way to parse it? I feel that it can be done much simpler and I am not even in the middle of parsing.
Could you please try following and let us know if this helps you.
awk -F"[><]" '{sub(/.*=/,"",$28);print $15,$23,$28}' Input_file
I'm sure the best way to solve your problem is to use an HTML parser. Solution for shown sample of file:
sed -r 's/.*(..\/..\/....).*>([0-9]*)<\/.*href=([^>]*)>/\1 \2 \3/I' input.txt
Personally, I'd use perl, but that's not what you asked, so...
A pedantic stepwise approach, so that you can edit bits of the logic when needed.
Assuming the input is a file named x:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Try this:
sed -E '
s/>$//;
s/href=/>/;
s/(<[^>]+>)+/~/g;
s/~[^~]+~//;
s/~[^~]+~/ /;
s/~/ /;
' x
Output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples"
11/07/2011 67100270 "https://stuff.com/findstones"
Explained:
sed -E '
This uses extended regexes, and opens a script of sed code so that I can list each pattern individually. Each will be executed in order on each line, so it's not super efficient, but it's "readable" as regex code goes, and reasonably maintainable once you understand it, and so easy to edit when something needs tweaking.
s/>$//;
Strip the closing > off the end, to preserve the URL before squashing out all the other tags.
s/href=/>/;
use the href= as a hook to insert the > back so we can squash out all the tags in one pass.
s/(<[^>]+>)+/~/g;
Convert ALL the strings of tags and everything still in them to a simple delimiter each.
s/~[^~]+~//;
Eliminate the leading and second delimiter and the first unneeded field between them.
s/~[^~]+~/ /;
Eliminate the third and fourth delimiters and the unneeded third field between them, replacing them with the space you wanted in the output.
Those two are very similar, and could certainly be combined with minimal shenannigans, but I left them nigh-redundant for easier explication.
s/~/ /;
Convert the remaining delimiter to the other space you wanted between the remaining fields.
' x
Close the script and give it the filename to read.
Obviously, this leaves a LOT of room for improvement, and is in many ways stylistically repulsive, but hopefully it is a simple explanation of tricks you can hack into a maintainably useful solution to your issue.
Good luck.

Bash script to copy from one line and replace another line with the copy

I am looking to write a bash script for something slightly more complicated than the usual find/replace via sed. I have a book called bash Cookbook that I have been trying to glean some inspiration from but I am not getting very far.
Basically I am trying to write a script to update the version numbers in a bunch of maven pom.xml files automatically. Here is the general setup I am looking at:
<!-- TEMPLATE:BEGIN
<version>##VERSION##</version>
-->
<version>1.0.0</version>
<!-- TEMPLATE:END -->
After running the script (with the new version number 1.0.1) I'd like the file to read this instead:
<!-- TEMPLATE:BEGIN
<version>##VERSION##</version>
-->
<version>1.0.1</version>
<!-- TEMPLATE:END -->
So this would be in the actual release pom file, with 1.0.0 being the current version (and I am trying to replace it with 1.0.1 or something). Obviously the version number will be changing so there isn't a good way to do a find/replace (since the thing you want to find is variable). I am hoping to be able to write a bash script which can
replace ##VERSION## with the actual version number
delete the current version line
write the updated version line on the line before the TEMPLATE:END (while preserving the ##VERSION## in the file - possibly do this by writing template out to a temp file, doing replacement, then back in?)
I can sort of do some of this (writing out to a new file, doing replacement) using an ant script a la
<replace file="pom.xml">
<replacefilter
token="##VERSION##"
value="${version}"/>
</replace>
But I am not sure what the best ways to a.) delete the line with the old version or b.) tell it to copy the new line in the correct place are. Anyone know how to do this or have any advice?
Assuming the new version number is in a shell variable $VERSION, then you should be able to use:
sed -e '/<!-- TEMPLATE:BEGIN/,/<!-- TEMPLATE:END -->/{
s/<version>[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*<\/version>/<version>'$VERSION'<\/version>/
}'
Note that this ignores the template version line with ##VERSION##, but only matches a three-part version number that appears between the lines containing TEMPLATE:BEGIN and TEMPLATE:END, leaving everything else (including other lines containing a <version>...</version> element) alone.
You can decide how to do file overwriting (maybe your version of sed is from GNU and it does that automatically on request with the -i option), etc. You might also be able to use more powerful regular expression notations that lead to more compact matches. However, that should work on most versions of sed without change.
The steps you outlined (1-3) read as if you do not actually care to perform the replacement in accordance to the templated rules defined within the comments.
As such, here is some code that behaves verbosely as you outlined:
#!/bin/bash
file=$1
newversion=$2
sed -i $file -e "s|<version>\([^#]*\)</version>|<version>$newversion</version>|"
Run it:
chmod +x yourscript.sh
./yourscript.sh filetoupdate.xml 1.0.1
use 5.010;
use strictures;
use Perl::Version qw();
use XML::LibXML qw();
my $dom = XML::LibXML->load_xml(location => 'pox.xml');
for my $node ($dom->findnodes('//version')) {
my $version = Perl::Version->new($node->textContent);
$version->inc_subversion;
$version->stringify;
$node->removeChildNodes;
$node->appendText($version);
};
say $dom->toString;

Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep)

In Linux shell, I am trying to return links to JPG files from the downloaded HTML script file. So far I only got to this point:
grep 'http://[:print:]*.jpg' 'www_page.html'
I don't want to use auxiliary commands like 'tr', 'cut', 'sed' etc...'lynx' is okay!
Using grep alone without massaging the file is doable but not recommended as many have pointed out in the comments.
If you can loosen up your requirements a bit then you can use html tidy to massage the downloaded HTML file so that each html entities are on a single line so that the regular expression can be simpler like you wanted, something like this:
$ tidy file.html|grep -o 'http://[[:print:]]*.jpg'
Note the use of "-o" option to grep to print only the matching part of the input

How do I type this TextMate Keyboard command?

I have a Ruby on Rails (RoR) document open and want to make the <%= %> pair of brackets. In TextMate, it's under Bundles > Ruby > Insert ERB's and the key command looks like:
^ >
How do I type that on a Mac? Shift+Ctrl+> doesn't work.
The TextMate document must be set to HTML (Rails), not Ruby on Rails.

Resources