Web scraping using rvest - rvest

I want to scrape all the text in the following website:
http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172
My code:
html = http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172
main_content <- html_nodes(html, css = "#document_content")
main_text <- main_content %>% html_nodes("p") %>%html_text()
However, in this way, not all the text are extracted because some text is in the node "dd"..."/dd"
I wonder if I can do something like html_nodes("p") or html_nodes("dd") or html_nodes("dt") to replace html_nodes("p") in the above dode.
How can I achieve this? Or is there any other way I can accomplish my task? Ideally, I dont want to use
main_text <- main_content %>% html_text()
because I want to separate each sentence.

When selecting css, if you separate the nodes you want by a comma, that is like a logical OR...
library("rvest")
url = "http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172"
page <- read_html(url)
main_text <- page %>%
html_nodes("#document_content") %>%
html_nodes("p,dd,dt") %>%
html_text()

Related

How do I add colour based on values in gt table?

I've generated the following gt table and I was wondering how to add blue colour to the cells under the Neurons, Astro, Oligo, Micro and Endothelia columns. I would want the shade of blue to get darker the larger the value. I have tried everything but it just isn't working. Also, where can I see what colours/shades are available? Thank you advance!!
The table generated
the code I used:
# read in the subject metadata file that has sample/experimental labels
decon_CIB_IP <- read.csv("CIBERSORT_IP_dataset.csv", row.names=NULL, fill=TRUE)
gt_tbl <- gt(decon_CIB_IP)
# Generate a simple table with a stub
# and add a stubhead label
gt_tbl <-
gt_tbl %>%
tab_stubhead(label = "Subject")
# put a heading just above the column labels
gt_tbl <-
gt_tbl %>%
tab_header(
title = "Deconvolution against IP dataset using CIBERSORT v1.04 method"
)
gt_tbl <-
gt_tbl %>%
tab_source_note(
source_note = "Dataset provided by Zhang et al. 201641 and performed using BrainDeconvShiny"
)
# add another column title
gt_tbl <-
gt_tbl %>%
tab_spanner(
label = "Cell Type",
columns = c(Neurons, Astrocytes, Oligodendrocytes, Microglia, Endothelia)
)
# Show the gt Table
gt_tbl

How to map discrete colours in a Plotly Sunburst chart in r

I am very new to using plotly in rstudio and have come up against a problem with mapping discrete colours (stored as hex codes in the field color) to each of the slices in my ids field.
I have included my code below:
df %>%
plot_ly(
color = I("black"),
marker = list(colors = ~color)) %>%
add_trace(ids = df$ids,
labels = df$labels,
parents = df$parents,
type = 'sunburst',
maxdepth = -1,
domain = list(column = 0)) %>%
layout(sunburstcolorway = df$color)
This is the resulting sunburst diagram I get using this code, which is obviously not ideal:
Ideally the first four levels would have the same colour, and then different hex colour codes are used for slices that are labelled "Poor","Moderate","GwC" or "Good".
A csv file of my data frame used above is available here.
I finally managed to nut out how to map my colour field to the background colours on the sunburst chart - have updated the code in original post. All that was required was to insert the following code segment:
plot_ly(
marker = list(colors = ~color))
Below is the output chart:

How to include or in a for loop python

I'm trying my first little web scraping attempts with python and I have come across the following problem:
for resultat in tr.find_all(class_='tc fs-17 white bg-darkgrey p-r' or class_='tc fs-9 white bg-red mb-2 lh-data'):
data.append(resultat.text)
I need to ensure on the for loop that either the class_ is data is appended to the data. however, I have no clue how to do it.
A little help would be appreciated.
Regards,
For simplicity, you can separate the loop.
for resultat in tr.find_all(class_='tc fs-17 white bg-darkgrey p-r'):
data.append(resultat.text)
for resultat in tr.find_all(class_='tc fs-9 white bg-red mb-2 lh-data'):
data.append(resultat.text)
If you want to use one loop, you can firstly use the .extend method to append one list to another.
background_darkgrey = tr.find_all(class_='tc fs-17 white bg-darkgrey p-r')
background_red = tr.find_all(class_='tc fs-9 white bg-red mb-2 lh-data')
elements_to_scrape = background_darkgrey.extend(background_red)
And after it, just iterate over elements_to_scrape.
for resultat in elements_to_scrape:
data.append(resultat.text)

Finding SVG Elements using RSelenium and XPath

I am writing a dynamic web scraper for a private website, where I am scraping the results of different bar charts available for sports teams. The problem is that the teams city name is the text available in the span element with class 'bar-chart__value', but there are duplicates (i.e. New York, Los Angeles).
It seems the only place where unique values are available for the team name is in the svg element, but I can't figure out how to find the svg element using xpath.
leaderboard <- map_dfr(1:length(a), function(x){
team <- remDr$findElements("xpath", "//span[#class = 'bar-chart__logo']/*[name() = 'svg']")[[x]]$getElementText()
if(team == "Average") {
number <- remDr$findElements("xpath", "//span[#class = 'bar-chart__value']")[[x]]$getElementText()
avg <<- x
} else if(x > avg){
number <- remDr$findElements("xpath", "//span[#class = 'bar-chart__value']
//span[#class = 'play-link__number-span']")[[x-1]]$getElementText()
} else {
number <- remDr$findElements("xpath", "//span[#class = 'bar-chart__value']
//span[#class = 'play-link__number-span']")[[x]]$getElementText()
}
df <- tibble(unlist(team), unlist(number))
colnames(df) <- c("Team", specific)
return(df)
})
Does anyone know how to use xpath in the findElement method to find an svg element? This is the code returning the error:
remDr$findElements("xpath", "//span[#class = 'bar-chart__logo']/*[name() = 'svg']")[[1]]$getElementText()
use local-name() instead of name()-> this might include the prefix.
remDr$findElements("xpath", "//span[#class = 'bar-chart__logo']/*[local-name() = 'svg']")[x]$getElementText()

two textplots in one plot

I have been trying to work with textplot in R and am unsure if my question is possible or not, I know that par() can't be used to place two textplots in one plot. I have been using a page and this code to try and figure things out.
My question is: Is it possible to have two textplots within the same plot?
For example, in the par(mfrow=c(1,1)) scenario below, plot 1 is a texplot of species length. Say I wanted to replicate that textplot twice in that plot. Is that possible?
based on this site:
http://svitsrv25.epfl.ch/R-doc/library/gplots/html/textplot.html
textplot(version)
data(iris)
par(mfrow=c(1,1))
info <- sapply( split(iris$Sepal.Length, iris$Species),
function(x) round(c(Mean=mean(x), SD=sd(x), N=gdata::nobs(x)),2) )
textplot( info, valign="top" )
title("Sepal Length by Species")
What I want to do is put a second textplot within that plot, underneath the original. For arguments sake, replicating that textplot twice in the plot.
Is this possible?
Thanks!
Maybe you've figured it out in the last four months but I thought I'd chip in an answer anyway.
The code provided is most of the way towards doing what you require already, you just have to provide some additional inputs to title() and/or par(). Namely specify that the title is to be above both of the plots by using title("your title", outer = TRUE) and you can further adjust the position of the title with an option in par(), use par(mfrow = c(2,1), oma = c(0,0,"top",0)). Hopefully this answers your question.
require('gplots')
data(iris)
info <- sapply(split(iris$Sepal.Length, iris$Species),
function(x) round(c(Mean = mean(x), SD = sd(x), N = gdata::nobs(x)),2))
## Replace top with a numerical value to control the position of the title with respect to the
## top of the page.
par(mfrow = c(2,1), oma = c(0,0, top ,0))
textplot(info, valign = "top")
textplot(info, valign = "top")
title("Sepal Length by Species", outer = TRUE)

Resources