Unable to parse webpage with frameset - groovy

I'm trying to parse Groovydoc, but Jsoup doesn't find the frameset in which everything is contained.
Connection connection=Jsoup.connect('http://groovy-lang.org/api.html')
Document document=connection.get()
Elements element= document.getElementsByTag('frameset')
element.each {println(it)}

If you check the result that is returned by connection.get() you can see that there is no frameset tag:
println document
Now, if you open the site in a browser and use development tools to look at it's html code you can see that the frameset you are looking for is a child of an iframe from source http://docs.groovy-lang.org/latest/html/gapi.
Just load the iframe url with Jsoup to get the frameset
Connection connection = Jsoup.connect('http://docs.groovy-lang.org/latest/html/gapi')
Document document = connection.get()
Elements element = document.getElementsByTag('frameset')
element.each { println it }
Or if you do not want to hardcode the iframe source url to parse, look at this SO answer on how to get the source url

Related

getting # after extracting href from <a> tag

Trying to scrape https://www.pagesjaunes.fr/annuaire/marseille-13/jardinier , I have a problem with pagination.
The link of next page is stored in tag. i get # after a['href'] not the link
tree = html.fromstring(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
Footer = soup.find(class_='result-footer')
divpagination= Footer.find(class_='pagination')
atag=divpagination.find("a", {"id": "pagination-next"})
print(atag.get('href'))
Output : #
Note: I Make the request without the Accept-Encoding header, that way the server doesn't compress the message to be sent
html tag :
Suivant
tag with beautifulsoup:
Suivant
As you can see if you inspect the page's source code in your browser (or just print it), this link uses js for navigation.
There are additional (non standard) properties to the tag so you can eventually try to reverse engineering the whole thing (check the tag attributes values, click the link in your browser and compare with the new page's effective url).
If it doesn't work then you'll need a headless browser and code to drive it (selenium being the canonical python solution).

scrapy pagination without href

I created a spider that takes the information from the table below, but I can not change to the previous table because it does not have "href", how do I?
https://br.soccerway.com/teams/italy/as-roma/1241/
previous button without href
<a rel="previous" class="previous " id="page_team_1_block_team_matches_summary_7_previous">« anterior</a>
If you look at network inspector in your browser you can see an XHR request being made when you click next button:
That request return json response with html changes:
You need to reverse engineer how your page generated this url (from the first image):
https://br.soccerway.com/a/block_team_matches_summary?block_id=page_team_1_block_team_matches_summary_7&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_371546%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22team_summary_block_teammatchessummary%22%2C%22team_id%22%3A1241%2C%22competition_id%22%3A0%2C%22filter%22%3A%22all%22%2C%22new_design%22%3Afalse%7D&action=changePage&params=%7B%22page%22%3A1%7D
And then you can use that to retrieve following pages.

node-phantom handle hyperlink of iframe fail

Recently, i integrate node and phantomjs by phantomjs-node. I opened page that has iframe element, i can get the hyperlink element of iframe, but failed when i execute click on it.
Do you have a way? Anyone can help me?
example:
page.open(url);
...
page.evaluate(function(res){
var childDoc = $(window.frames["iframe"].document),
submit = childDoc.find("[id='btnSave']"),
cf = submit.text();//succeed return text
submit.click()//failed
return cf;
},function(res){
console.log("result="+res);//result=submit
spage.render("test.png");//no submit the form
ph.exit();
});
You can't execute stuff in an iframe. You can only read from it. You even created a new document from the iframe, which will only contain the textual representation of the iframe, but it is in no way linked to the original iframe.
You would need to use page.switchToFrame to switch to the frame to execute stuff on the frame without copying it first.
It looks like switchToFrame is not implemented in phantomjs-node. You could try node-phantom.
If the iframe is on the same domain you can try the following from here:
submit = $("iframe").contents().find("[id='btnSave']")
cf = submit.text();
submit.click()
If the iframe is not from the same domain, you will need to create the page with web security turned off:
phantom.create('--web-security=false', function(page){...});

Construct the url to fetch a page by url

With this code I create a page in a Google site:
pageEntry = new WebPageEntry();
pageEntry.setTitle(new PlainTextConstruct(PageTitle));
..
..
client.insert(new URL(getContentFeedUrl()), pageEntry);
If the PageTitle contains something like "création" the page will created with the name https://sites.google.com/.../.../cration. So "création" is changed to "cration".
Is the process to change the page name available in the API? I would like to fetch the page by its path, but the only key I have is "création".
Maybe a better solution would be to strip the diacritics from the characters in the string before setting it as a page title? For instance, see the javascript function here. Then you page would be created with the URL /creation, which could be more desireable.

chrome extension inject in frameset code

i develop some extension for google grome
i inject at document_end event my js (+jquery)
i set in manifest allFrames: true
my match url has frameset
my goal get element by id inside one of frame
i do
//wait for load my frame
$("frame[name='header']).load(function() {
//here I need get element by id inside this frame
});
how to do this properly?
PS: my frame is from the same domain
You dont need to do document load, I assume your doing this from a content script. Just place the "run_at" manifest to be document_end and within your content script you check if the current URL is that frame page, then you will be in that Domain.
Something like this:
if(location.href == 'http://someweb.com/path/to/page.html') {
var foo = document.getElementById('foo')
Something like that will get you started.

Resources