I need to extract certain information from HTML using VBA.
This is the HTML from which I am trying to extract the location information alone.
<dl id="headline" class="demographic-info adr">
<dt>Location</dt>
<dd>
<span class="locality">
Dallas/Fort Worth Area
</span>
</dd>
<dt>Industry</dt>
<dd class="industry">
Higher Education
</dd>
In my excel VBA, after opening the web page, I am using the following code to extract the information.
Dim openedpage as String
openedpage = iedoc1.getElementById("headline").innerText
However, I am getting the information as,
Location Dallas/Fort Worth Area Industry Higher Education
I just need to extract,
Dallas/Fort Worth Area as the output.
Try: iedoc1.getElementById("headline").getElementsByTagName("span")(0).innerText
Your getting all the extra text because that is kinda what you asked for, the innerText of the parent element, which is everything inside of it.
The above code gets the content of the "headline" element, then finds all "span" tags inside of it. Looking at the list returned, it chooses the first instance and returns the innerText.
Update
I always seem to get the index base wrong, the 1 in my example should have been a 0
Related
Hi I have this HTML code from a website: I want to be able to extract multiple images, I have cases where there is 3-4 images. How would I go about doing that?
<div style="float: right;"><u>Chakra required:</u>
<img src="https://naruto-arena.net/images/energy/energy_2.gif">
<img src="https://naruto-arena.net/images/energy/energy_4.gif">
</div>
My code:
chakras1 = soup.find_all("div")[42].img['src']
print(chakras1)
Result:
https://naruto-arena.net/images/energy/energy_2.gif
I only get the FIRST image but now the second one.
To extract multiple images from within a tag, I personally stick to using for loops. So what I mean is once you find the div tag that you want to look within and say you call that chakras1, I would write the following:
for img in chakras1.find_all("img"):
print(img)
For me the steps kind of go as follows:
find the specific tags you are looking for (img src)
see what tag those tags are within (
navigate through the HTML to that tag (using beautiful soup's .find or .find_all functions depending on what you want to use)
once you have navigated to the tag search within that tag for the tags you're really looking for.
One quick note, with this method, if you are looking through multiple div tags, you're also going to need to loop through those as well.
I hope this makes sense!
So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.
I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!
I'm using Python to scrape a retailer website for its html. I looking for the data and attribute on their air conditioning products, such as Energy Efficiency, Constant or Variable Type, etc. etc. Hence, I used requests.get() and afterwards I plan to filter the data using regex or bs4.
file_number = 0
for portal in portals:
item = requests.get(portal)
item_text = str(item.text)
file_number += 1
file_name = "blah" + file_number.zfill(4) + ".txt"
file = open(file_name,"w",encoding="utf8")
file.write(item_text)
file.close()
I could retrieve all html pages from the set() I've compiled. However, the product price is missing. This piece of information is present if I go the page and directly right-click --> inspect.
The example below is just one instance of the differences. The two files are the same, except all references to the prices are omitted (Just a wild guess: the price could appear slightly differently depending on who's shopping, that's why there're stored separately somehow.)
Also be glad to listen to any suggestion on code improvement, I'm brand new to python!
requests.get() version of info
<div class="p-price">
<strong class="J-p-32965125681"></strong> <span>X <span class="J-buy-num"></span></span>
</div>
vs
right-click --> inspect version of info
<div class="p-price">
<strong class="J-p-32965125681">¥3499.00</strong> <span>X <span class="J-buy-num"></span></span>
</div>
Thank you so much!
By the way, disclaimer the robots.txt says:
User-agent: *
Disallow: /?*
And I'm not crawling any page that have "?" in their url so...
Web Scraping is tricky!
At a first glance, the values seens to be added via javascript. In that case, you'll need to use a headless browser or extension to scrap the DOM after the page concludes loading, not the skeleton html page on the site.
We've noticed a bug when looking at French search results. in the CMS Desk, i've kept the Page Name in English for the French content. The issue is, these are showing on the French results page.
in the transformation, based off the default one, I present the clickable title like this:
<a href='<%# SearchResultUrl() %>' data-type="title" target="_blank" ><%#SearchHighlight(HTMLHelper.HTMLEncode(CMS.ExtendedControls.ControlsHelper.RemoveDynamicControls(DataHelper.GetNotEmpty(Eval("Title"), ""))), "<span class='highLight'>", "</span>")%></a>
Here's my thinking, if the Menu Caption is filled out, use that rather than title. How do i output DocumentMenuCaption without adjust the search fields on the menu page type?
I think my logic is, check if DocumentMenuCaption is emtpy, if it use, use Title.
You should be able to continue using GetNotEmpty and just pass in the DocumentMenuCaption first, something like this:
<%# GetNotEmpty(GetSearchValue("DocumentMenuCaption");Eval("Title")) %>
You may or may not need the "GetSearchValue" function, but that allows you to grab values from the object that may not be available in the default set of columns for the search results.
Alternatively, you should be able to use the IfEmpty() method:
<%# IfEmpty(GetSearchValue("DocumentMenuCaption"), Eval("Title"), GetSearchValue("DocumentMenuCaption")) %>
Both transformation methods taken from here (double check syntax on "GetNotEmpty" as there are different ways it's implemented: https://docs.kentico.com/k9/developing-websites/loading-and-displaying-data-on-websites/writing-transformations/reference-transformation-methods
You can read more about the search transformations here: https://docs.kentico.com/k9/configuring-kentico/setting-up-search-on-your-website/displaying-search-results-using-transformations
I am trying to use python to scrape TripAdvisor and pull text from a specific span ---> <span>138<span> (without the Excellent)
<label for="taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_5">
<div class="row_label">Excellent</div>
<span class="row_bar">
<span class="row_fill" style="width:65%;"></span>
</span>
<span>138<span>
</span></span></label>
This is my code thus far:
for rating_all in moresoup.findAll('div',{'class':'col rating '}):
for record in rating_all.findAll('li'):
for rate1 in record.findAll('label',{'for':"taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_1"}):
print(rate1.find('div',{'class':"row_label"}).text + ",\t")
print(rate1.findAll('span'))
I tried using a subscript but it wouldn't let me. When I use the .text after the span it says there is no text, when I change it to find instead of find all it only finds the first span.
findAll (or more commonly find_all - which does the same thing) returns a list of all Tag objects matching your filters. Even if there is only one matching Tag you will still get a one-item list: [Tag].
Once you have the list of tags, you can get a single tag by indexing, e.g.:
soup.find_all('span')[0]
and you can get the text of one your tags with the .text attribute:
soup.find_all('span')[0].text
In your particular case, I was able to get the text '138\n' with:
rate1.findAll('span')[2].text