I have the following XML:
<topic class="Top">
<title>
Interesting Article
</title>
<subtitle>
Science & Industry
<insertedText action="start"/>
Inside & Out
<insertedText action="end"/>
A Profile
</subtitle>
</topic>
And I would like to use xpath to extract the text of the subtitle, except for the string between the two <insertedText> nodes, giving the text "Science & Industry A Profile".
Here is my latest attempt but to be honest I'm stumped and realise that this does not exclude the text between the two tags! Any help would be appreciated:
/topic[#class='Top']/*[local-name()='subtitle'][not(descendant::insertedText)]/text()
The number of <insertedText> tags is also variable, so there may be none or multiple sets on <insertedText> tags which should be ignored.
A further example of the type XML that might be encountered is below:
<topic class="Top">
<title>
Interesting Article
</title>
<subtitle>
Science & Industry
<insertedText action="start"/>
Inside & Out
<insertedText action="end"/>
A Profile
<insertedText action="start"/>
An Insiders View
<insertedText action="end"/>
The Full Story
</subtitle>
</topic>
Answer
The full answer, based on that provided by #lambo477, is as follows:
./topic[#class='top']/*[local-name()='subtitle']/text()[1]|/topic[#class='top']/*[local-name()='subtitle']/*[local-name()='insertedText'][#action='end']/following-sibling::node()[1]
You could try the following XPath:
//subtitle/text()[1]|//insertedText[#action="end"]/following-sibling::node()[1]
Related
Hy
How i can add a default image for my site to display in the google search results when i type the name of the site to search for?
Currently every few weeks a different image/picture is displaying, current one picture from my news.
Google Search result =
https://www.may_site.com [ logo ]
Currently I added a logo to my first link at the top of the page..
<img src="/images/logo.png" width="70" height="70" alt="Sitename Logo">
Sometimes no picture is visible :/
Google reads meta tags in your page <head></head> to understand what you want to show in search results.
As far as I know there is not an official "logo" or "image" tag that Google support, but is smart enough to understand that "og:image" is the page image.
<head>
<meta name="og:image" content="http://ia.media-imdb.com/rock.jpg">
</head>
You can read more here: https://developers.google.com/search/docs/advanced/crawling/special-tags
And you can find a comprehensive list of meta here
https://gist.github.com/lancejpollard/1978404
Google supports defining your logo in structured data. Here is their documentation on it:
https://developers.google.com/search/docs/advanced/structured-data/logo
However, I think you are talking more about a general image related to a page.
Using the og:image meta tag mentioned by #supermod can be a hint. Google also understands images in certain structured data types like recipes, products, articles etc. Their gallery shows what structured data can cause rich snippets like an image:
https://developers.google.com/search/docs/advanced/structured-data/search-gallery
But it is not necessary to provide metadata or structured data to get images in the search results. Sometimes Google just picks one from the page.
So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.
I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!
Reading mode in Spartan/Edge seems to choose, somehow, which div on the site to display in reading mode. In many pages, it does not find the appropriate div (like bbc.co.uk).
However, on our site, it enables reading mode, but then displays the completely wrong part of the page.
So - how can I tell it to take the right part or at least how to disable it on those pages
You can find information on how to optimize reading view, as well as how to opt-out, here: http://dev.modern.ie/testdrive/demos/readingview/
07/10: Edit to include specific information
Specifically, you may be interested in optimizing your title, body, and image markup to ensure a good reading mode experience.
Title
Your page should include a <title> element in the header. In addition, you should include a <meta title=""> tag that matches your main heading in your content section.
Body text
Ensure your main content does not include a lot of deeply nested elements and that font-sizes and other styles are uniform. Style variations for things like pull quotes, etc. should still be fine.
Images
The first eligible image becomes the dominant image of the article. The dominant image is rendered as the first piece of content and given full column width. All following images are rendered as inline images within the article.
Images are recommended to be wrapped in <figure> tags with no more than two <figcaption> tags.
Opting out
Including this meta tag will disable reading mode in IE11 and, currently, Microsoft Edge.
<meta name="IE_RM_OFF" content="true">
Add the following Tag
<meta name="IE_RM_OFF" content="true">
Check the Below for more details
http://dev.modern.ie/testdrive/demos/readingview/
I want to collect pictures from Google image search. However, I am constantly notified with an error.
For example, the URL https://www.google.com/search?q=banana&hl=en&gws_rd=ssl&tbm=isch is fine in my browser, but in web harvest it reports that: the reference to entity "gws_rd" must end with the ';' delimiter.
I guess '&' is a special character in webharvest, but I cannot find information about it. Can you figure out why?
This is the code:
<var-def name="search" overwrite="false">banana</var-def>
<var-def name="url"><template>http://images.google.com/images?q=${search}&hl=en</template></var-def>
<var-def name="xml">
<html-to-xml>
<http url="${url}"/>
</html-to-xml>
</var-def>
<var-def name="largeImgUrl">
<xpath expression="//*[#id='irc_cc']/div[4]/div[1]/div/div[2]/div[1]/a/img">
<var name="xml"/>
</xpath>
</var-def>
from experience you will need to first store the url in a variable, and then refer to the variable from within the http processor call
EDIT
I notice you have pasted your code. Good.
1) remember that all the webharvest config files are written in XML, and amersand & is a special character in XML, as it is part of the entity declaration
In webharvest i normaly avoid this issue by using CDATA sections within <template> or <code> blocks.
2)when using webharvest graphical interface, you can easily debug your xpath expressions. Run your code as normal, and then on the toolbar at the top click the icon with a magniffying glass. Then choose "xml" (name of your variable you have set). This will open a new window, with a preview of your xml. Make sure the "view as" dropdown is set to xml.
You should now have a "xpath expression" box where you can test your xpath.
3)I strongly discourage from writing xpaths referring to numbered elements. (eg div[4]/div[1]/div/div[2]/div[1]/). Any small change in the underlying page usually breaks the code. It is much better to select elements based on id or other properties.
I have gotten the hang of using the html agility pack to find specific nodes using their attributes and xpaths. The problem is, I've been doing this manually for each of my projects (opening the website html and scanning for the nodes that have the text i need). Is there a way to select a single node by its inner text? This would make it easier to write an update script for websites whose content scheme is the same, but attribute tags change values over time. Thanks in advance!
Would be better if you have provided sample HTML, but since you haven't, let's assume we have HTML containing this markup :
<body>
<div class="foo">bar</div>
</body>
You can select the <div> by it's attribute using HtmlAgilityPack's SelectSingleNode() and XPath like so :
myHtmlDocument.DocumentNode.SelectSingleNode("//div[#class='foo']");
or you can select the same by the inner text like so :
myHtmlDocument.DocumentNode.SelectSingleNode("//div[.='bar']");
Hope this help.