Incomplete html attribute when using rvest - rvest

I'm using rvest to scrape from https://www.psychologytoday.com/ca/therapists/m5g ; in particular what I'm after is the data-myurl html attribute in the div tag with id="results-page" . If you view source you'll see there's only one div with id="results-page" . The data-myurl attribute looks like the main URL except with the addition of a string of numbers separated by a period and underscore, like so
<div id="results-page" data-myurl="https://www.psychologytoday.com/ca/therapists/m5g?sid=1510588046.3852_2969">
The numbers you see will likely be different. To try and extract it, I use the following code:
require(rvest)
fsa <- read_html('https://www.psychologytoday.com/ca/therapists/m5g')
fsa %>% html_node('div #results-page') %>% html_attr("data-myurl")
However, this returns only
[1] "https://www.psychologytoday.com/ca/therapists/m5g"
So everything after the original URL is missing. It doesn't seem like a JS thing since I don't see any script tags when I view source. Does anyone know what these numbers in the URL actually are and how to extract them? Thanks!

You can't do this with rvest.
The page you're trying to scrape is dynamically rendered after loading the initial page. The content itself is always the same, but the sid numbers change the ordering of the results after loading the page. The sid changes on every visit and page reload.
I suspect this was done to avoid a market bias when searching for a therapist.
If you really want the sid number, you need to use a tool that handles dynamic pages like casperjs.
(http://casperjs.org/)
Edit:
Alternatively, if it has to be done in R then you can use RSelenium. (https://cran.r-project.org/web/packages/RSelenium/)
The relevant starting point would be here:
https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html

Related

I wanted to know what can I use to place objects anywhere in a website, like positioning a table in any part of the website? Thank you

I'm new to HTML and I would like to know how to place any object in any part of the website, for example, a table between the center and the left side. Thank you
HTML is about the structure of your website and webpage. Once you have the structure with all the elements written in HTML, you can customise their display and position through CSS (it's another language). Check this page to get a first understanding of how it looks like https://www.w3schools.com/css/.

Kentico 9 transformation page/file URL

In my custom page type, you can select an uploaded file. That's fine, but in my ascx transformation, i'm having a hard time getting the URL. The field is 'Process'.
Here's what i currently have.
<%# IfEmpty(Eval("Process"),"N/A","<a href=" + Eval("Process") +" target='blank' class='icon download'>Download</a>")%>
When rendered, the html is this:
Download
I'm missing something.
You can use either of the 2 methods below. Both have their downfalls though.
<a href="<%# GetFileUrl("Process", "Something") %>"Link here<a/> this will
Downfall with this is if there is no value in the "Process" field, it will return an invalid URL. So I tend to use something a little better (but not much)
Item to download
This will create a valid URL with some invalid properties to it. Meaning if there is no value in the Process field, it will return 00000000-0000-0000-0000-000000000000. If the NodeAlias field is empty, it will return "download". So again, not 100% fool-proof but it works well in most cases.
Update
Check out this link:
https://devnet.kentico.com/articles/options-for-file-fields-in-structured-data
I think the piece you need in here is in the "CMS.File page type" section:
This is the link to the picture
Check out transformation methods reference
You can use <%#GetImage(Eval("Process"))%>. This will return an Image tag. There are a couple other parameters for sizing if you want to use those.
See the "Transformation reference" link on your Trasnformation editor, it goes to all the available transformation methods you can use.
In it it shows:
This will generate an actual image tag. If however you want a link, it usually is
/getattachment/<%# Eval("TheImage")%>/ImageFileNameCanBeAnythingThough.jpg
example:
/getattachment/1936c69d-a28c-428a-90a7-09154918da0f/Christmas.jpg

WKHTMLTOPDF Dynamic Header on every page

I am trying to produce a PDF file using WKHTMLTOPDF library in NODE for a large HTML file. I need to be able to stuff in some content in the Header and Footer on every page. But the content on the header changes on every page for e.g, have custom numbering in a format like BX008761. The number should increment on every page.
First page will be BX008761, second page BX008762, third BX008763 so on..
I could find a thread which is related..
WKHTMLTOPDF -- Is possible to display dynamic headers?
the above thread states:
"you can feed --header-html almost anything :) Try the following to see my point:
wkhtmltopdf.exe --margin-top 30mm --header-html isitchristmas.com google.fi x.pdf
So isitchristmas.com could be www.yoursite.com/magical/ponies.php"
does the source value provided for --header-html option be called for every page of the PDF rendered or it is called just once for every PDF..?
Appreciate your support.Thank you.
EDIT : I have tried a sample program and confirmed that it will process the value provided for --header-html option on every page rendered with in PDF. I am using a remote service to return the HTML string as a response to the url.
Now it is displaying the html string as is, instead of decoding it.
when the service returns below string:
<html> <body> <span style="color:red" > 123 :: 0 :: 3000025 :: 634943551338828720</span> <body> <html>
then the header on every page is also same as above instead of displaying the text in red color. how do i make the wkhtmltohtml understand that the content it received from service need to be decoded.
appreciate if any one can suggest a workaround.
Thank you.
EDIT : I have used another work around to return a HTML page for the header content. I used essentially a HTTPHandler in asp.net to return a valid response and the issue looks to have addressed the core issue of having a dynamic header on every page.

How Can I Use Shadowbox to Extract Text Only from Webpage?

I have an article set up in Joomla that displays Terms and Conditions for the site users. I would like this to show up in a shadowbox when a user clicks a link. Here is the current anchor text example:
Terms and Conditions
This works out great for displaying the entire web page, but what I would like to do is just display the article text on the page (plain with a white background). Is this in someway possible with shadowbox? If so, how?
If I'm understanding you correctly - you want to suppress the modules and other periphery from your 'page' when it is loaded in the shadowbox.
Add ?tmpl=component to the url of your link.
You can do this with a div element and css shadow effect.
How to show/hide div is explained here:
http://www.randomsnippets.com/2008/02/12/how-to-hide-and-show-your-div/
How you can add shadow is explained here:
http://placenamehere.com/article/384/css3boxshadowininternetexplorerblurshadow/
I believe there are some components to do this - but you may have to get creative to do it without pulling the whole page with an a href tag.
In the database there's a particular area that holds that specifically and you could write a little query to just pull that information specifically and put it in the shadowbox, but what that query would look like I'm not sure.

Can I link to an HTML file in my project from a UIWebView?

If I load a string containing HTML into a UIWebView, and that string contains objects (hyperlinks) that are relative to that string, i.e. , where there is some object with id "something," then the link works - click on it and the web view jumps to the referenced object.
What I want is to get navigation to a different file in my project, in other words as though the path to the different file were a URL.
I have found that if the href IS a URL, such as href="http://www.amazon.com", then the link works.
If I put the name of a file, OR the [NSBundle mainBundle] pathForResource: ] of that name, in the href, then the link does not work.
Is there some way I can generate the equivalent of a URL pointing to an HTML file that is in the project, so that an can link to that HTML file?
I found a solution at this link:
How to use Javascript to communicate with Objective-c code?
Essentially, the solution is to implement the UIWebViewDelegate protocol's shouldStartLoadWithRequest method, and "trap" a particular value of scheme. So my links, instead of saying something like:
<a href="http://someplace.location">
are like:
<a href="mylink://#filename.ext">
By catching attempts to load anything with scheme "mylink," I can use:
[[request URL] fragment]
within shouldStartLoadWithRequest, and get the filename.ext. I then release my previous UIWebView, load in the contents of the specified file, and make that the contents of a new UIWebView. The effect is that the links work with normal appearance, even though they are being implemented with my code. I return NO because I don't want the usual loading to take place. If the scheme is NOT mylink, I can return YES to allow normal operation.
Regrettably, I still have no way to jump TO a fragment within a web view. In linking to a real URL, you can say something like "www.foo.org#page50" and jump straight to wherever an object on the new page has an id of "page50." With my method, I can only go to the top of the page.
This is also not going to give me a "go-back" function unless I record the filenames and implement it myself.

Resources