How do I get at image title attributes in Watir? - watir

Given the following HTML code (which, I realise, sucks, but that's not something I can currently solve):
<img height="64" width="64" class='list_item' src="/img/icon/first.jpg"
title="This is the first item::Completed the item "I did this first"" alt="First" />
gives me a result of (this is an image.to_s)
name:
type:
id:
value:
disabled:
src: /img/icon/first.jpg
width: 64
height: 64
alt: First
Note lack of "title" element. This does not actually change (the lack of the title element)
If I get the contents of the parent div of one of those icons, I get something like:
<img class="list_item" I="" did="" this="" first="" src="/img/icon/first.jpg" alt="First">
The broken HTML of the original has been turned into separate attributes somewhere down the line, but the title tag appears to have been stripped completely, and since it's the contents of the title tag I need, I'm a little stuck.
This has been tried with lastest Watir on Ruby 1.9.2 using Firefox.
Perfect world solution: I'd like to get the original transmitted HTML for the image tag, so I can "special case" (ie, hack) around the stupid double-quote problem.
Good Enough Solution: the contents of the title tag.

There is actually a #title method on Watir::Image. With the above incorrect HTML the output would be like this (where 'i' is the Image object):
i.title
=> "This is the first item::Completed the item "
This shows only part of the title.
But you could use #html and then parse all the necessary information out of it with some magic:
i.html
=> "<IMG class=list_item title=\"This is the first item::Completed the item \" alt=First src=\"/img/icon/first.jpg\" width=64 height=64 first?? this did I>"
But as other answers above have mentioned - you cannot get it out correctly due to the bad HTML. Maybe there's some other way to accomplish your bigger goal you're having?

getting the title probably isn't working because the way the title attribute is set on that element isn't valid. entities " and < and > need to be escaped inside html attributes, with " and < and > respectively. Escape the quotes and try again.

Not sure, but I don't think Watir supports image titles. I looked over the Supported Elements page, title was x'ed out. I don't see it in the RDoc for Watir::Image type either.

Related

How to filter text after webscraping

So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.
I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!

Kentico 9 transformation page/file URL

In my custom page type, you can select an uploaded file. That's fine, but in my ascx transformation, i'm having a hard time getting the URL. The field is 'Process'.
Here's what i currently have.
<%# IfEmpty(Eval("Process"),"N/A","<a href=" + Eval("Process") +" target='blank' class='icon download'>Download</a>")%>
When rendered, the html is this:
Download
I'm missing something.
You can use either of the 2 methods below. Both have their downfalls though.
<a href="<%# GetFileUrl("Process", "Something") %>"Link here<a/> this will
Downfall with this is if there is no value in the "Process" field, it will return an invalid URL. So I tend to use something a little better (but not much)
Item to download
This will create a valid URL with some invalid properties to it. Meaning if there is no value in the Process field, it will return 00000000-0000-0000-0000-000000000000. If the NodeAlias field is empty, it will return "download". So again, not 100% fool-proof but it works well in most cases.
Update
Check out this link:
https://devnet.kentico.com/articles/options-for-file-fields-in-structured-data
I think the piece you need in here is in the "CMS.File page type" section:
This is the link to the picture
Check out transformation methods reference
You can use <%#GetImage(Eval("Process"))%>. This will return an Image tag. There are a couple other parameters for sizing if you want to use those.
See the "Transformation reference" link on your Trasnformation editor, it goes to all the available transformation methods you can use.
In it it shows:
This will generate an actual image tag. If however you want a link, it usually is
/getattachment/<%# Eval("TheImage")%>/ImageFileNameCanBeAnythingThough.jpg
example:
/getattachment/1936c69d-a28c-428a-90a7-09154918da0f/Christmas.jpg

How to handle PDF pagination in PhantomJS

I am using PhantomJS to create PDFs from html.
It works fine, but I can't find out how to work with pagination; I want to create a page for each div in my document, but I can't find anything in the doc. about pagination.
If my document is short, it makes only one page, and if it is bigger, it creates one second empty page and my contents are in the first page which becomes very long.
Any idea ? (I am using phantomJS-node module for nodeJS)
PhantomJS takes care of webkit’s css implementation. To implement manual page breaks you can use these properties :
page-break-before : auto/always/avoid/...
page-break-inside : auto/always/avoid/...
page-break-after : auto/always/avoid/...
For example, a div can be :
<div style="page-break-before:always;"><!-- content --></div>
or
<div style="page-break-after:always;"> <!-- content --></div>
Controlling page breaks when printing in Webkit is sometimes not easy, in particular with long html tables.
Very late, but I had issues with "break-inside:avoid" using JsReport that were fixed by changing the element's display type to inline-block. More info here:
https://github.com/ariya/phantomjs/issues/10638
You should see this issue with different tips.
Try to use display:inline-block in the element that you don't want to breaks because the page break. The reasoning behind is that webkit already tries to preserve images from breaking. And images are inline-blocks.
Pagination works fine with :
var page = webPage.create();
page.paperSize = {
format: 'A4',
orientation: 'portrait',
margin: '1cm'
}
Check documentation here http://phantomjs.org/api/webpage/property/paper-size.html

Overlapping HTML Content In WebView, when using: webview.getSettings().setLayoutAlgorithm(LayoutAlgorithm.SINGLE_COLUMN);

So I load some simple HTML into a webview.
The html is in the format: (it is extracted html, therefore no body tags.
<p>..blah</p>
<div style="text-align:center;">
<a href="http://services.runescape.com/m=rswikiimages/en/2012/5/combat1-18140712.jpg" target="_blank">
<img src="http://services.runescape.com/m=rswikiimages/en/2012/5/combat1_thumb-18140804.jpg"></a></div>
<p>..blah</p>
In the webview activity class I add:
webInfo = "<body style=\"color:white;font-size:15px\">" + webInfo + "</body>";
//for setting the size and color of the text
I use the following to load the html:
webview.loadDataWithBaseURL(null,webInfo,"text/html","UTF-8",null);
// webInfo is an html string
and I do the following so that it doesn't scroll horizontally and so that the image fits within the screen.
webview.getSettings().setLayoutAlgorithm(LayoutAlgorithm.SINGLE_COLUMN);|
If I don't use SINGLE_COLUMN, I'm always scrolling horizontally, even though the text fits nicely within the page. I thought it had finally solved the problem but I get overlapping html when i load it. This fixes itself if I zoom slightly in or out with my fingers.
Here's a picture of whats happening:
Question 1:
How do I fix that problem and display the image so it fits in the screen and no horizontal scrolling is required?
Question 2:
I've also been having trouble getting rid of the weird symbols too, my 's show up really weird.Tried using several different encodings like UTF-8, US-ACSII and lots of Windows ones, none of them seem to work to get rid of the symbols.
I hope I've been clear, any help is appreciated, I've been fiddling with this for a while O_O...
Add in webview settings:
webSettings.setLayoutAlgorithm(LayoutAlgorithm.TEXT_AUTOSIZING) //handle for font overlapping.

How to get the effect of 'user-select: text' in css3?

The validator at http://jigsaw.w3.org/css-validator/ says that the value 'text' for 'user-select' is not valid. For a css rule with this code in it:
user-select: text;
the validator says:
text is not a user-select value : text text
Presumably this is because of this behavior, specified at (the outdated) http://www.w3.org/TR/2000/WD-css3-userint-20000216#user-select:
This property is not inherited, but it does affect children in the
same way that display: none does, it limits it. That is if an element
is user-select: none, it doesn't matter what the user-select value is
of its children, the element's contents or it's childrens contents
cannot be selected.
Also, I only see the attribute value 'text' specified in that out-of-date css3 doc from w3.org:
http://www.w3.org/TR/2000/WD-css3-userint-20000216#user-select
and not in the latest one: http://www.w3.org/TR/css3-ui/
Additionally, searching 'whatwg.org' yields nothing.
Any ideas if 'user-select: text' is valid css3, and if not, what should be used instead?
This would be used, for example, when overriding 'user-select: none' rules applied to containers of text and ancestor containers.
You are getting this wrong. user-select:text doesn't mean it would select text only. It's default value of user-select property. W3C describe it this way:
The element's contents follow a standard text content selection model.
And Also MDN syas something same:
-moz-none The text of the element and sub-elements cannot be selected,
but selection can be enabled on sub-elements using
-moz-user-select:text .
So I don't think this should prevent selecting images or boxes.
As far as I know user-select:text is useful when you have user-select:none for most or all of your elements and you have a textbox or text area that is kind of output and you want it be selectable for copying and pasting.
It seems if you use -webkit- prefix it works for me. I'm sure it works with -moz- prefix too. Test this fiddle in your browser. I don't know why user-select:text is not working on my Chrome 13 Mac?

Resources