I'm using Python to scrape a retailer website for its html. I looking for the data and attribute on their air conditioning products, such as Energy Efficiency, Constant or Variable Type, etc. etc. Hence, I used requests.get() and afterwards I plan to filter the data using regex or bs4.
file_number = 0
for portal in portals:
item = requests.get(portal)
item_text = str(item.text)
file_number += 1
file_name = "blah" + file_number.zfill(4) + ".txt"
file = open(file_name,"w",encoding="utf8")
file.write(item_text)
file.close()
I could retrieve all html pages from the set() I've compiled. However, the product price is missing. This piece of information is present if I go the page and directly right-click --> inspect.
The example below is just one instance of the differences. The two files are the same, except all references to the prices are omitted (Just a wild guess: the price could appear slightly differently depending on who's shopping, that's why there're stored separately somehow.)
Also be glad to listen to any suggestion on code improvement, I'm brand new to python!
requests.get() version of info
<div class="p-price">
<strong class="J-p-32965125681"></strong> <span>X <span class="J-buy-num"></span></span>
</div>
vs
right-click --> inspect version of info
<div class="p-price">
<strong class="J-p-32965125681">¥3499.00</strong> <span>X <span class="J-buy-num"></span></span>
</div>
Thank you so much!
By the way, disclaimer the robots.txt says:
User-agent: *
Disallow: /?*
And I'm not crawling any page that have "?" in their url so...
Web Scraping is tricky!
At a first glance, the values seens to be added via javascript. In that case, you'll need to use a headless browser or extension to scrap the DOM after the page concludes loading, not the skeleton html page on the site.
Related
So I'm trying to webscrape this website that provides novels for free, for example this page: https://www.wuxiaworld.com/novel/martial-world/mw-chapter-1
I'm trying to only extract the title and the body of the chapter. Finding the title is easy enough since its in h4, however the body of the chapter is not separated by any specific div tags so I cannot just isolate it. I was wondering how I'd do this. The closest Ive gotten to just having the text is this.
Ps. Im new to webscraping, sorry if my question is unclear or stupid.
I tried to identify if the body of text was under any exclusive div tag but it wasn't, so i tried to call it under whatever the closest div tag was, this still returned a lot of useless and unwanted text.
edit : #koro, there's more than one instance of fr-view being used so it doesn't isolate the text. fr-view class also appears before the chapter text.
I'm not versed in webscraping but upon reviewing the page source html I see that <div class="fr-view"> only precedes the body text on the novel pages. If you start the logging after the scraper identifies this line you should be able to stop at the very next <a href="/novel..... tag to only have the novel text included.
Some of the pages I see also include footnotes with some extra information, these include an <a href=#footnote....> tag, so if you would like to keep the footnotes included I would search for <a href=/novel...> and NOT <a href=...>
P.S. I only looked at 4 pages and while they all appear to have the same format that I've pointed out above it's still possible that you may run into issues, but that's definitely something you can a bridge you can cross when you get there!
I have just started to use Python 3 & Beautiful soup and have obtained most of the visible information that I can see on the page using find() or findAll().
One of the pages that I have looked at dynamically loads an item code (productPLU) that is hidden on the page that you see normally, but is contained within the HTML code as per below:-
div class="product ng-scope" ng-controller="ProductListItemController" ng-init="productPLU = '384353'"
Is it possible to scrape the productPLU information?
If so, what syntax would I use?
Thanks in advance.
I'm using rvest to scrape from https://www.psychologytoday.com/ca/therapists/m5g ; in particular what I'm after is the data-myurl html attribute in the div tag with id="results-page" . If you view source you'll see there's only one div with id="results-page" . The data-myurl attribute looks like the main URL except with the addition of a string of numbers separated by a period and underscore, like so
<div id="results-page" data-myurl="https://www.psychologytoday.com/ca/therapists/m5g?sid=1510588046.3852_2969">
The numbers you see will likely be different. To try and extract it, I use the following code:
require(rvest)
fsa <- read_html('https://www.psychologytoday.com/ca/therapists/m5g')
fsa %>% html_node('div #results-page') %>% html_attr("data-myurl")
However, this returns only
[1] "https://www.psychologytoday.com/ca/therapists/m5g"
So everything after the original URL is missing. It doesn't seem like a JS thing since I don't see any script tags when I view source. Does anyone know what these numbers in the URL actually are and how to extract them? Thanks!
You can't do this with rvest.
The page you're trying to scrape is dynamically rendered after loading the initial page. The content itself is always the same, but the sid numbers change the ordering of the results after loading the page. The sid changes on every visit and page reload.
I suspect this was done to avoid a market bias when searching for a therapist.
If you really want the sid number, you need to use a tool that handles dynamic pages like casperjs.
(http://casperjs.org/)
Edit:
Alternatively, if it has to be done in R then you can use RSelenium. (https://cran.r-project.org/web/packages/RSelenium/)
The relevant starting point would be here:
https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html
I need to extract certain information from HTML using VBA.
This is the HTML from which I am trying to extract the location information alone.
<dl id="headline" class="demographic-info adr">
<dt>Location</dt>
<dd>
<span class="locality">
Dallas/Fort Worth Area
</span>
</dd>
<dt>Industry</dt>
<dd class="industry">
Higher Education
</dd>
In my excel VBA, after opening the web page, I am using the following code to extract the information.
Dim openedpage as String
openedpage = iedoc1.getElementById("headline").innerText
However, I am getting the information as,
Location Dallas/Fort Worth Area Industry Higher Education
I just need to extract,
Dallas/Fort Worth Area as the output.
Try: iedoc1.getElementById("headline").getElementsByTagName("span")(0).innerText
Your getting all the extra text because that is kinda what you asked for, the innerText of the parent element, which is everything inside of it.
The above code gets the content of the "headline" element, then finds all "span" tags inside of it. Looking at the list returned, it chooses the first instance and returns the innerText.
Update
I always seem to get the index base wrong, the 1 in my example should have been a 0
I am running an experiment in which we are trying to train people to be synaesthetes (they have additional experience of colour associated with numbers or letters).
I wondered if anyone has some advise about the easiest way to modify a web browser, such as firefox, so that just 10 letters A-J would always be displayed in a specific colour on any page they visited on the web?
Much appreciated
There are many ways to do this (cross-browser):
For example you could define a -element in a stylesheet to have a different color.
When loading the document, you check via JavaScript/jQuery the whole document (but only the contents of tags like ) for your specified letters and add the -tag f.e. around them.
Not the best solution, but a way.
Take a look at Greasemonkey, a FireFox plug-in designed to do this kind of thing. There are lots of pre-made scripts available at http://userscripts.org/, and several of them look like they'd help you figure out how to write your own to re-color single letters.
Here is just an abstract rough draft of a blueprinted preliminary form of a beta version of a potential solution: using the javascript: prefix of links in a bookmark as follow.
Create a new entry in your bookmarks toolbar
In the URL input, copy/paste the following line: javascript:var html = document.body.innerHTML; html = html.replace(/([a-j])/ig, '<span style="color: red;">$1</span>'); while(html.match(/(<[^>]*)<[^>]+>([^<]+)<\/[^>]+>([^>]*>)/g) != null) {html = html.replace(/(<[^>]*)<[^>]+>([^<]+)<\/[^>]+>([^>]*>)/g, '$1$2$3');} document.body.innerHTML = html;
Give your bookmark a name (e.g. "A-J to red") and save
You can now visit any website and click on that bookmark, which will put all letters between a and j in red
In a more digest way:
// get the content of the body
var html = document.body.innerHTML;
// surround any letter between a and j by a <span></span>
html = html.replace(/([a-j])/ig, '<span style="color: red;">$1</span>');
// but it also replaces a-j letters within html tags
while(html.match(/(<[^>]*)<[^>]+>([^<]+)<\/[^>]+>([^>]*>)/g) != null) {
// so if there are html tags within other html tags, delete the created <span></span>
html = html.replace(/(<[^>]*)<[^>]+>([^<]+)<\/[^>]+>([^>]*>)/g, '$1$2$3');
}
// and replace the innerHTML of the body
document.body.innerHTML = html;
That's really not a final solution, but yeah, maybe you could work on it to improve the results.
PS: don't try with IE...