how to separate html_text result using rvest? - rvest

I am trying to scrape information from google scholar web page:
https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science
library(rvest)
htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"
g_interest<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int") %>% html_text()
I got the following result:
[1] "Quantum Chemistry Electronic Structure Condensed Matter Physics Materials Science Nanotechnology "
[2] "density functional theory first principles calculations many body theory condensed matter physics materials science "
[3] "chemistry materials science physics nanotechnology "
[4] "Materials Science Nanotechnology Chemistry Physics "
[5] "Physics Theoretical Physics Condensed Matter Theory Materials Science Nanoscience "
[6] "Materials Science Quantum Chemistry Fiber Optic Sensors Geophysics "
[7] "Chemical Physics Condensed Matter Materials Science Magnetic Properties NMR "
[8] "Materials Science "
[9] "Materials Science Physics "
[10] "Physics Materials Science Theoretical Physics Nanoscience "
However, I would like to get the results like:
[1]"Quantum Chemistry; Electronic Structure;Condensed Matter Physics; Materials Science; Nanotechnology "
......
Any suggestions how to separate the results with ";"?

You can make use of purrr and stringr packages, extract all nodes first and concatenate individual ones.
library(rvest)
library(purrr)
library(stringr)
htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"
content_nodes<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int")
map_chr(content_nodes,~.x %>%
html_nodes(".gsc_oai_one_int") %>%
html_text() %>%
str_c(collapse = ";"))
result:
[1] "Quantum Chemistry;Electronic Structure;Condensed Matter Physics;Materials Science;Nanotechnology"
[2] "density functional theory;first principles calculations;many body theory;condensed matter physics;materials science"
[3] "chemistry;materials science;physics;nanotechnology"
[4] "Materials Science;Nanotechnology;Chemistry;Physics"
[5] "Physics;Theoretical Physics;Condensed Matter Theory;Materials Science;Nanoscience"
[6] "Materials Science;Quantum Chemistry;Fiber Optic Sensors;Geophysics"
[7] "Chemical Physics;Condensed Matter;Materials Science;Magnetic Properties;NMR"
[8] "Materials Science"
[9] "Materials Science;Physics"
[10] "Physics;Materials Science;Theoretical Physics;Nanoscience"

Related

Is Delta E a reliable guide to the difference between colors?

I'm attempting to order some color swatches by "maximum difference" and am getting some odd results. For example, in the image below, colors numbers 5 and 14 appear, at least to my (non-colorblind) eye, rather similar, much more so than some of the following colors, but yet seem to have a higher minimum ΔE (when calculated against all the previous colors) than many of those colors that follow it.
Is ΔE considered a reliable way of calculating a "perceptual distance" between colors, or should I be using something else?
In order, the colors shown here are:
> x.hex
[1] "#060186" "#8EF905" "#F5C5F7" "#805200" "#0DE0FE" "#D0135D" "#0B834F" "#FABF74" "#417BEA" "#FA4D01"
[11] "#DC39FC" "#590708" "#919913" "#01DDAE" "#068896" "#D28B8B" "#7C4C8E" "#A3BCE7" "#0C5378" "#F1E11E"
[21] "#A24731" "#495C0D" "#01B633" "#4A30FE" "#BB7D0A" "#680F41" "#C1D597" "#FC75C1" "#A786C7" "#29A4DD"
[31] "#FD0A3D" "#43A99B" "#B16A8D" "#D002A2" "#BA7755" "#FECBB6" "#253467" "#FF9143" "#8A763A" "#5960A6"
[41] "#B79D66" "#70A271"
And the minimum ΔE against the previous colors:
> DeList
[1] 117.25473 69.53788 55.00019 46.90173 38.54371 37.20359 36.32807 35.23608 28.57360 27.10889
[11] 26.77178 25.24130 24.39958 24.24133 22.51525 22.23315 20.50791 19.93881 19.63842 19.45253
[21] 19.31200 19.04087 18.90102 18.64973 18.25806 18.08846 17.55115 17.19687 16.82420 15.35578
[31] 15.17018 14.95605 14.77414 14.67706 14.67703 14.37527 14.16665 14.02716 14.00375 13.90574
[41] 13.84133
I'm calculating ΔE using the R package spacesXYZ, using the formula:
spacesXYZ::DeltaE( lab.matrix[i,], lab.matrix[j,], 2000 )
and calculating the hex code from the LAB matrix using:
x <- lab.matrix[pal.list,] # extract the LAB numbers from the matrix
x.lab <- colorspace::LAB(x) # convert to an LAB colorspace class
(x.hex <- colorspace::hex(x.lab)) # convert to hex string

is there a method to detect person and associate a text?

I have a text like :
Take a loot at some of the first confirmed Forum speakers: John
Sequiera Graduated in Biology at Facultad de Ciencias Exactas y
Naturales,University of Buenos Aires, Argentina. In 2004 obtained a
PhD in Biology (Molecular Neuroscience), at University of Buenos
Aires, mentored by Prof. Marcelo Rubinstein. Between 2005 and 2008
pursued postdoctoral training at Pasteur Institute (Paris) mentored by
Prof Jean-Pierre Changeux, to investigate the role of nicotinic
receptors in executive behaviors. Motivated by a deep interest in
investigating human neurological diseases, in 2009 joined the
Institute of Psychiatry at King’s College London where she performed
basic research with a translational perspective in the field of
neurodegeneration.
Since 2016 has been chief of instructors / Adjunct professor at University of Buenos Aires, Facultad de Ciencias Exactas y Naturales.
Tom Gonzalez is a professor of Neuroscience at the Sussex Neuroscience, School of Life Sciences, University of Sussex. Prof.
Baden studies how neurons and networks compute, using the beautiful
collection of circuits that make up the vertebrate retina as a model.
I want to have in output :
[{"person" : "John Sequiera" , "content": "Graduated in Biology at Facultad...."},{"person" : "Tom Gonzalez" , "content": "is a professor of Neuroscience at the Sussex..."}]
so we want to get NER : PER for person and in content we put all contents after detecting person until we found a new person in the text ...
it is possible ?
i try to use spacy to extract NER , but i found a difficulty to get content :
import spacy
​
nlp = spacy.load("en_core_web_lg")
doc = nlp(text)
​
for ent in doc.ents:
print(ent.text,ent.label_)

In python, headings not in the same row

I extracted three columns from a larger data frame (recent_grads) as follows...
df = recent_grads.groupby('Major_category')['Men', 'Women'].sum()
However, when I print df, it comes up as follows...
Men Women
Major_category
Agriculture & Natural Resources 40357.0 35263.0
Arts 134390.0 222740.0
Biology & Life Science 184919.0 268943.0
Business 667852.0 634524.0
Communications & Journalism 131921.0 260680.0
Computers & Mathematics 208725.0 90283.0
Education 103526.0 455603.0
Engineering 408307.0 129276.0
Health 75517.0 387713.0
Humanities & Liberal Arts 272846.0 440622.0
Industrial Arts & Consumer Services 103781.0 126011.0
Interdisciplinary 2817.0 9479.0
Law & Public Policy 91129.0 87978.0
Physical Sciences 95390.0 90089.0
Psychology & Social Work 98115.0 382892.0
Social Science 256834.0 273132.0
How do I get Major_category heading in the same row as Men and Women headings? I tried to put the three columns in a new data frame as follows...
df1 = df[['Major_category', 'Men', 'Women']].copy()
This gives me an error (Major_category not in index)
Hi man you should try reset_index https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html:
df = df.groupby('Major_category')['Men', 'Women'].sum()
# Print the output.
md = df.reset_index()
print(md)
Seems like you want to convert the groupby object back to a dataframe try:
df['Major_category'].apply(pd.DataFrame)

unable to get rid of all emojis

I need help removing emojis. I looked at some other stackoverflow questions and this is what I am de but for some reason my code doesn't get rid of all the emojis
d= {'alexveachfashion': 'Fashion Style * Haute Couture * Wearable Tech * VR\n👓👜⌚👠\nSoundCloud is Live #alexveach\n👇New YouTube Episodes ▶️👇', 'andrewvng': 'Family | Fitness | Friends | Gym | Food', 'runvi.official': 'Accurate measurement via SMART insoles & real-time AI coaching. Improve your technique & BOOST your performance with every run.\nSoon on Kickstarter!', 'triing': 'Augmented Jewellery™️ • Montreal. Canada.', 'gedeanekenshima': 'Prof na Etec Albert Einstein, Mestranda em Automação e Controle de Processos, Engenheira de Controle e Automação, Técnica em Automação Industrial.', 'jetyourdaddy': '', 'lavonne_sun': '☄️🔭 ✨\n°●°。Visual Narrative\nA creative heart with a poetic soul.\n————————————\nPARSONS —— Design & Technology', 'taysearch': 'All the World’s Information At Your Fingertips. (Literally) Est. 1991🇺🇸 🎀#PrincessofSearch 🔎Sample 👇🏽 the Search Engine Here 🗽', 'hijewellery': 'Fine 3D printed jewellery for tech lovers #3dprintedjewelry #wearabletech #jewellery', 'yhanchristian': 'Estudante de Engenharia, Maker e viciado em café.', 'femka': 'Fashion Futurist + Fashion Tech Lab Founder #technoirlab + Fashion Designer / Parsons & CSM Grad / Obsessed with #fashiontech #future #cryptocurrency', 'sinhbisen': 'Creator, TRiiNG, augmented jewellery label ⭕️ Transhumanist ⭕️ Corporeal cartographer ⭕️', 'stellawearables': '#StellaWearables ✉️Info#StellaWearables.com Premium Wearable Technology That Monitors Personal Health & Environments ☀️🏝🏜🏔', 'ivoomi_india': 'We are the manufacturers of the most innovative technologies and user-friendly gadgets with a global presence.', 'bgutenschwager': "When it comes to life, it's all about the experience.\nGoogle Mapper 🗺\n360 Photographer 📷\nBrand Rep #QuickTutor", 'storiesofdesign': 'Putting stories at the heart of brands and businesses | Cornwall and London | #storiesofdesign', 'trume.jp': '草創期から国産ウオッチの製造に取り組み、挑戦を続けてきたエプソンが世界に放つ新ブランド「TRUME」(トゥルーム)。目指すのは、最先端技術でアナログウオッチを極めるブランド。', 'themarinesss': "I didn't choose the blog life, the blog life chose me | Aspiring Children's Book Author | www.slayathomemum.com", 'ayowearable': 'The world’s first light-based wearable that helps you sleep better, beat jet lag and have more energy! #goAYO Get yours at:', 'wearyourowntechs': 'Bringing you the latest trends, Current Products and Reviews of Wearable Technology. Discover how they can enhance your Life and Lifestyle', 'roxfordwatches': 'The Roxford | The most stylish and customizable fitness smartwatch. Tracks your steps/calories/dist/sleep. Comes with FOUR bands, and a travel case!', 'playertek': "Track your entire performance - every training session, every match. \nBecause the best players don't hide.", '_kate_hartman_': '', 'hmsmc10': 'Health & Wellness 🍎\nBoston, MA 🏙\nSuffolk MPA ‘17 🎓 \n.\nJust Strong Ambassador 🏋🏻\u200d♀️', 'gadgetxtreme': 'Dedicated to reviewing gadgets, technologies, internet products and breaking tech news. Follow us to see daily vblogs on all the disruptive tech..', 'freedom.journey.leader': '📍MN\n🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿 \n📧Ashleybp5#gmail.com \n#homeschool #bossmom #builder #momlife', 'arts_food_life': 'Life through my phone.', 'medgizmo': 'Wearable #tech: #health #healthcare #wellness #gadgets #apps. Images/links provided as information resource only; doesn’t mean we endorse referenced', 'sawearables': 'The home of wearable tech in South Africa!\n--> #WearableTech #WearableTechnology #FitnessTech Find your wearable #', 'shop.mercury': 'Changing the way you charge.⚡️\nGet exclusive product discounts, and help us reach our goal below!🔋', 'invisawear': 'PRE-ORDERS NOW AVAILABLE! Get yours 25% OFF here: #girlboss #wearabletech'}
for key in d:
print("---with emojis----")
print(d[key])
print("---emojis removed----")
x=''.join(c for c in d[key] if c <= '\uFFFF')
print(x)
output example
---with emojis----
📍MN
🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿
📧Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---emojis removed----
MN
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---with emojis----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!🔋
---emojis removed----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!
There is no technical definition of what an "emoji" is. Various glyphs may be used to render printable characters, symbols, control characters and the like. What seems like an "emoji" to you may be part of normal script to others.
What you probably want to do is to look at the Unicode category of each character and filter out various categories. While this does not solve the "emoji"-definition-problem per se, you get much better control over what you are actually doing without removing, for example, literally all characters of languages spoken by 2/3 of the planet.
Instead of filtering out certain categories, you may filter everything except the lower- and uppercase letters (and numbers). However, be aware that ꙭ is not "the googly eyes emoji" but the CYRILLIC SMALL LETTER DOUBLE MONOCULAR O, which is a normal lowercase letter to millions of people.
For example:
import unicodedata
s = "🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿"
# Just filter category "symbol"
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', ))
print(t)
...results in
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
This may not be emoji-free enough, yet the • is technically a form of punctuation. So filter this as well
# Filter symbols and punctuations. You may want 'Cc' as well,
# to get rid of control characters. Beware that newlines are a
# form of control-character.
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', 'Po'))
print(t)
And you get
Wife Homeschooling Mom to 5 D Y I lover Small town living in MN

Isolate section of long character string in R

I have locality information in a vector that contains [lat.];[long.];[town name];[governorate name];[country name].
I want to create a new vector that includes just the town name for each observation.
Below you can see the contents of the vector for the first four observations.
[1] 36.7416894818782;10.227453200912464;Ben Arous;Gouvernorat de Ben Arous;TN;
[2] 37.17652020713887;9.784534661865223;Tinjah;Gouvernorat de Bizerte;TN;
[3] 34.7313;10.763400000000047;Sfax;Sfax;TN;
[4] 34.829474860751915;9.791573033378995;Regueb;Gouvernorat de Sidi Bouzid;TN;
I want to output a vector that looks like:
[1] Ben Arous
[2] Tinjah
[3] Sfax
[4] Regueb
You could use read.table with sep=";":
d <- read.table(textConnection("
36.7416894818782;10.227453200912464;Ben Arous;Gouvernorat de Ben Arous;TN;
37.17652020713887;9.784534661865223;Tinjah;Gouvernorat de Bizerte;TN;
34.7313;10.763400000000047;Sfax;Sfax;TN;
34.829474860751915;9.791573033378995;Regueb;Gouvernorat de Sidi Bouzid;TN;"),
sep=";", stringsAsFactors=FALSE)
d[, 3]
# [1] "Ben Arous" "Tinjah" "Sfax" "Regueb"

Resources