Extract multiple image links from a single div tag that has no class name

Extract multiple image links from a single div tag that has no class name - python-3.x

Hi I have this HTML code from a website: I want to be able to extract multiple images, I have cases where there is 3-4 images. How would I go about doing that?
<div style="float: right;"><u>Chakra required:</u>
<img src="https://naruto-arena.net/images/energy/energy_2.gif">
<img src="https://naruto-arena.net/images/energy/energy_4.gif">
</div>
My code:
chakras1 = soup.find_all("div")[42].img['src']
print(chakras1)
Result:
https://naruto-arena.net/images/energy/energy_2.gif
I only get the FIRST image but now the second one.

To extract multiple images from within a tag, I personally stick to using for loops. So what I mean is once you find the div tag that you want to look within and say you call that chakras1, I would write the following:
for img in chakras1.find_all("img"):
print(img)
For me the steps kind of go as follows:
find the specific tags you are looking for (img src)
see what tag those tags are within (
navigate through the HTML to that tag (using beautiful soup's .find or .find_all functions depending on what you want to use)
once you have navigated to the tag search within that tag for the tags you're really looking for.
One quick note, with this method, if you are looking through multiple div tags, you're also going to need to loop through those as well.
I hope this makes sense!

Related

How to filter text after webscraping

So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.

I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!

Kentico 9 search result transformation

We've noticed a bug when looking at French search results. in the CMS Desk, i've kept the Page Name in English for the French content. The issue is, these are showing on the French results page.
in the transformation, based off the default one, I present the clickable title like this:
<a href='<%# SearchResultUrl() %>' data-type="title" target="_blank" ><%#SearchHighlight(HTMLHelper.HTMLEncode(CMS.ExtendedControls.ControlsHelper.RemoveDynamicControls(DataHelper.GetNotEmpty(Eval("Title"), ""))), "<span class='highLight'>", "</span>")%></a>
Here's my thinking, if the Menu Caption is filled out, use that rather than title. How do i output DocumentMenuCaption without adjust the search fields on the menu page type?
I think my logic is, check if DocumentMenuCaption is emtpy, if it use, use Title.

You should be able to continue using GetNotEmpty and just pass in the DocumentMenuCaption first, something like this:
<%# GetNotEmpty(GetSearchValue("DocumentMenuCaption");Eval("Title")) %>
You may or may not need the "GetSearchValue" function, but that allows you to grab values from the object that may not be available in the default set of columns for the search results.
Alternatively, you should be able to use the IfEmpty() method:
<%# IfEmpty(GetSearchValue("DocumentMenuCaption"), Eval("Title"), GetSearchValue("DocumentMenuCaption")) %>
Both transformation methods taken from here (double check syntax on "GetNotEmpty" as there are different ways it's implemented: https://docs.kentico.com/k9/developing-websites/loading-and-displaying-data-on-websites/writing-transformations/reference-transformation-methods
You can read more about the search transformations here: https://docs.kentico.com/k9/configuring-kentico/setting-up-search-on-your-website/displaying-search-results-using-transformations

ExpressionEngine add class to wrap parameter

I have entries that contain an image as a channel entry field. I am using the following syntax in my template to generate an alt tag w/ data. The problem I have is that I haven't found a way to add a class attribute to the generated img tag.
{info-image wrap="image" class="img-responsive"}
outputs
<img src="http://secretproject.dev/images/uploads/general/dedication2.jpg" alt="dedication2">
where the alt data is the image title.
If I take another approach and write
<img class="img-responsive" src="{info-image}" alt="{title}">
then the title comes from the entry itself and not the file title.
In summary, I need both class and alt attributes but seem to be stuck with one or the other. Is there a way to achieve this in either syntax? Is there another approach? Thanks

You can use the variable pair syntax as described in the documentation.
{info-image}
<img src="{url}" alt="{title}" class="img-responsive">
{/info-image}

HTML Agility pack + Select node by its inner text

I have gotten the hang of using the html agility pack to find specific nodes using their attributes and xpaths. The problem is, I've been doing this manually for each of my projects (opening the website html and scanning for the nodes that have the text i need). Is there a way to select a single node by its inner text? This would make it easier to write an update script for websites whose content scheme is the same, but attribute tags change values over time. Thanks in advance!

Would be better if you have provided sample HTML, but since you haven't, let's assume we have HTML containing this markup :
<body>
<div class="foo">bar</div>
</body>
You can select the <div> by it's attribute using HtmlAgilityPack's SelectSingleNode() and XPath like so :
myHtmlDocument.DocumentNode.SelectSingleNode("//div[#class='foo']");
or you can select the same by the inner text like so :
myHtmlDocument.DocumentNode.SelectSingleNode("//div[.='bar']");
Hope this help.

Selecting parent elements with WebdriverJS

I'm trying to create some scripts which require moving through a lot of on-the-fly HTML with random IDs. They require getting the parents of an element - but I'm not sure how to implement this in WebdriverJS.
If I access the element I want via a console, I can do the following to get it;
document.querySelector('span[email="noreply#example.com"]').parentNode.parentNode
Is there a way to do this in WDJS? I've looked and can't see anything obvious - it's specifically the parent stuff I'm having issue with. I saw that a possible solution may be xPath however I'm unsure of syntax having never used it before.
Thanks in advance!

I don't know the syntax of WebDriverJS. But the XPath is as below, you need a way to fit it in somewhere.
This is based on your CSS Selector, so please show HTML if needed.
.//span[#email='noreply#example.com']/../..
For example, if you have HTML like this
<div>
<div>
<span email="noreply#example.com">Contact me</span>
</div>
</div>
You can avoid using .. to go up.
.//div[./div/span[#email='noreply#example.com']]
If you have more levels to look up, another handy method would be using ancestor from XPath Axes.
Also, as #sircapsalot brought up, CSS selectors spec doesn't support parent selecting, so XPath is the only way to go, unless you inject JS.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract multiple image links from a single div tag that has no class name - python-3.x

Related

How to filter text after webscraping

Kentico 9 search result transformation

ExpressionEngine add class to wrap parameter

HTML Agility pack + Select node by its inner text

Selecting parent elements with WebdriverJS

Categories

Resources